Admin workstations / operator workstations¶
The PunchPlatform admin/operator stations provide users an access to the PunchPlatform configuration files and operational commands. A linux server or laptop can be installed with the punchplatform operator distribution to get access to the platform, and in turn, let authorized users to configure channels.
An automated deployment and update tool. It is used to install and update punchplatforms
"Backlog" means "events yet to come", "work remaining to be done". In particular kafka refers to the data sent by a producer not yet processed by consumer(s)."
This is a standard Apache Storm concept. Inside punchplatform we refer to "storm-like" punchline node instead. A bolt is a Storm topology component in charge of processing the data. The bolt can send ("emit" using Storm terminology) the result data items further the topology graph. A terminating bolt can push the data to eb external component such as a Kafka, Elasticsearch or any other nest hop. The PunchPlatform provides ready to use Kafka, Elasticsearch, TCP/UDP/SSL socket output bolts, as well as several processing bolts.
A channel groups several topologies, connected through Kafka queues. It defines an end-to-end transport and processing path up to one or several backends, each in charge of a specific role : archiving, searching, aggregating. The channel is the central PunchPlatform user managed entity. It can be configured, started, stopped or reloaded. See Channels for more details
Cybersecurity is the body of technologies (processes and practices) designed to protect data and IT from attacks, damage or unauthorized access. This is the IT part of security that has to be treated like physical security. The aim of cybersecurity is to protect Confidentiality, Integrity, Disponibility and Proof of Information. Often reduced to electronic concerns, Cybersecurity needs audits, enforcement, controls, and monitoring. Security supervision lies on human supervision and SIEM (Security Information and Event Management) which analyzes the events emitted by security sources. In the context of cybersecurity, PunchPlatform is focusing on the stakes of this subject.
Erasure coding is a mathematics processing used to provide resilience of data storage without requiring a full data replication. This method is implemented in some variant of RAID (for example RAID5) and is implemented inside CEPH software as an option when configuring storage pools for objects. The data to store is split in shards (for example 3 shards). This can be distributed on multiple storage nodes (here, 3) but does not provided any additional resilience. Then the erasure coding algorithm computes additional resilience shards (for example 2) of same size, mathematically combining some information from the 3 data shards. The main characteristic achieved is that if, among the overall shards (here, 3 + 2) you loose (because of an incident) some shards, but less than the number of "additional" shards, it is possible to reconstruct the initial data with the remaining shards. In this case, it means that even with 2 servers down, the remaining 3 shards allow to reconstruct the stored data. Of course, in order to "compute" the erasure-coding shards, or to reconstruct data from multiple shards, CPU resource is used which would not be required otherwise ; but this is a compromise that allows to reduce overall storage consumption, as compared to a fully replicated data storage. Please note also than in case of a definitive server loss, reconstructing resilience on a new server will require recomputing all shards for this server, which will imply heavy CPU consumption. For a resilient punchplatform, it is advised to have a +2 erasure coded resilience, so that in case of loss of a single server, the system is still resilient to a server loss during the duration of the data reconstruction to compensate for the loss of the first server
Git is a distributed, open-source source control/version management software. Git repositories are web/filesystem containers of ALL version of a filesystem directory and its subelements. Git repositories can be synchronized by pull and push to allow for consistent team work (cf. official Git documentation)
The operator command (through command line or other git managing tool) that requests saving as a new version in the git data the new changes that the user has made to local copy of files in a git working directory tree. directory tree. Git commits can be listed by going to the local working directory tree, and using
git logcommand ; note that this will only show commits known in the local repository, and not those that may exist in any other repository (such as a reference repository, or the local working directory tree of another user) but that have not been "pulled" from this other/reference repository.
The operator command (through command line or other git managing tool) that requests update of a local git working directory tree by retrieving recent changes (commit) that exist in a reference git repository
The operator command (through command line or other git managing tool) that requests upload to a reference git repository of recent changes (commit) that exist in local git working directory tree.
In Git change management system, versions data are stored in file system directories. This internal GIT data, which allows to retrieve any previous version of the managed content, and history information about managed content, is called repository. When multiple user are working on the same content, at different physical places, or on different version "branches", each user has a local repository associated to the local copy of the managed files. Synchronization between repositories are made only upon user\'s request (cf. git pull and git push). When multiple user want to have a \'reference\' repository, into which all changes are stored, dedicated bare repository is created to this purpose ; bare meaning that no local "working copy" of the managed content is associated to the git repository (therefore any change will come only from a
push) by one of the users.
Housekeeping is the automatic processing of the PunchPlatform administration services, in charge of archiving/reloading/destroying data automatically, as requested by configuration. This feature allows for automatic removal of the too old data.
A high-velocity, resilient messaging service. Data is written into topics themselves split into partitions for load-sharing. Topics are managed by clustered brokers that handle replication and resilience.
Because of the "no acknowledgement" logic of Kafka (see Kafka partition), Kafka has no way to know the depth of
backlogof messages which have not been consumed/processed yet by the consumers of a topic. Therefore, it belong to the consumers themselves to determine how much messages they yet have to consume. In PunchPlatform distributions, delivered Kafka consumer bolt Storm component instances are publishing this backlog depth in the metrics flow sent to Elasticsearch server (see
MonitoringGuide), which is displayed in Kibana dashboards.
A "topic" nearest common resembling concept in other messaging systems context is a "queue" but because Kafka does not manage data in queues (see partitions for explanation) an other term was chosen. The idea is nevertheless similar in the following characteristics : the data going through a topic is stored separately from other topics, and has dedicated retention settings. A topic allocated storage quota may be exhausted, without direct impact on other topics.
To achieve high speed and availability, Kafka does not manage any acknowledgement from the data consumers who read data from the queues. Therefore, / - consumers are responsible for persistent and consistent management of the "offset" which indicates which is the last message processed from the data provided by Kafka - consumers may re-read any message using its offset, regardless of previous reads, as long as kafka has not destroyed the message - kafka manages retention by duration and/or retention size policy of messages, regardless of actual processing by the consumers. / / With these principles, it is easier for load-sharing Kafka consumers to avoid sharing the same event flow, because otherwise they would have to synchronize their consumption offsets and have a common work splitting convention between them. / For that purpose, Kafka provides the "partition" concept, which is a "shard" of the topic messages. Message writers are responsible to distribute their messages between the topic partitions registered in Kafka (for example, using \'round robin\' strategy) ; consumers are responsible for handling one or more partitions of the topic, without sharing any partition with an other consumer of the group (this level of synchronization between workers is usually achieved through zookeeper usage.).
Kibana is a product from the Elastic galaxy. It provides a web portal that can read the contents of an Elasticsearch server, with beautifully designed visualizations. To see how powerful is Kibana, a good starting point is our Kibana Guide.
The punchplatform provides its own lightweight storm engine. Just like local topologies, it runs a topology in a single java process. I.e. it does not support multi workers topologies. It is used on small system and collector to reduce the solution footprint.
A PunchPlatform configured to operate as a Log Management solution is referred to as a LMC (Log Management Center). Some standard resources are proposed by PunchPlatform product team for initial configuration of a LMC, although each specific LMC instance may in fact have its own custom resources/parsers..., and is therefore not a standardized setup.
LMR stands for Log Management Receiver. It is a punchplatform configuration in charge or receiving logs from remote forwarders (LTRs). A LMR simply forwards the logs to a LMC.
Instead of submitting a storm/storm-like punchline to the Storm or shiva cluster, it is also possible to run it embedded in a single Java virtual machine. This execution mode is used mainly for testing and developing. We call punchline running that way local punchlines. Local punchlines or topologies are not visible to Storm cluster and monitoring server/UI or shiva leader and management topics.
An event generated by an IT equipment stored or sent in text format.
Cybersecuritycontext, Log Management, or Security Information Management (SIM), is the act of collecting, analyzing, storing and querying events emitted by IT systems (network equipments, Intrusion Detection Systems, other SIEMs, servers, SCADA...), often in
LTR stands for Log Transporter. It is a punchplatform configuration that forwards logs from a remote site to a LMC or LMR.
Scalable storage system have introduced a way to store data with less "central" bottleneck due to consistency management and indexation needed by file systems. In these storage frameworks, "files" have therefore been renamed "objects", because you cannot read them or write them "partially", as you would writing or reading a file one line at a time. Objects can be written or read as a whole, using their unique identifier. You usually cannot use a "directory structure" to sort them, although some storage division is usually provided (Amazon S3 "Buckets", Ceph "pools"...) for isolation/differentiated access control sake.Example of objects storage are Amazon S3, Ceph rados, Openstack Swift. All these provide access libraries and RESTful web service API. CEPH implementation has been chosen for PunchPlatform cluster deployment implementation, but punchplatform components that read/write in Ceph are designed to easily evolve towards reading/writing to other objects storage frameworks.
Parsing is a process to interface two systems that are sending data on different format. The act of "parsing" is analyzing the syntax of raw data coming in (often in human-readable format) and place the resulting elements in normalized fields. Although often related to Log Management, parsing applies also for a whole set of use cases: metrics, statistics etc.
A Plan defines how a Spark Punchline is periodically run. For instance, you may want to train your Pipeline on one hour of data and use it to make predictions on the next hour of data. A plan is composed of several configuration files : one is a (jinja) punchline template so as to contain variable elements, the other is a plan schedule configuration file to set these variables the required values. Typical configuration values are the start and stop of the training or prediction stages. Cron expressions are used to define repetitions. A Plan takes care of persisting the last successful schedule to guarantee
at least onceexecution A Standalone Quickstart demonstrates the concept in action.
A programming language to act on JSON document using a compact and intuitive syntax. To learn more about it, start with this guide
This is a standard Apache Storm concept. Inside punchplatform we refer to "storm-like" punchline punch node instead.
A Storm bolt in charge of execute Punchlet. Punchlets makes it extra easy to process JSON data, as well as acting on Storm streams and fields. See the Punch Bolt documentation for a full review.
A Punch program deployed in a Storm topology (using the Punch bolt).
It processes the traversing JSON documents on the fly.
A small program written using the punch language and deployed in a punchplatform channel. A punchlet is somehow equivalent to a servlet.
PunchPlatform Machine Learning (PML)¶
The PunchPlatform Machine Learning (PML) sdk lets you design and run arbitrary spark applications on your platform data. Why would you do that ? Punchplatforms process large quantities of data, parsed, normalized then indexed in Elasticsearch in real time. With that you have powerful search capabilities through Kibana and/or Grafana dashboards. Spark applications help you extracting lots of additional values from your data, from computing statistical indicators to run learn-then-predict or learn-then-detect applications. The scope of application is extremely wide, it all depends on your data. Doing that on your own is however not that easy. First you have to code and build your Spark pipeline application, taking care of lots of configuration issues such as selecting the input and output data sources. Once ready you have to deploy and run it in cycles of train then-detect/predict rounds, on enough real data so that you can evaluate if it outputs interesting findings. In many case you operate on production system where the real data resides, making it risked should you not master the resources needed by your application. In short : not that easy. The goal of PML is to render all that much more simple and safe. In a nutshell, PML lets you configure arbitrary spark pipelines. Instead of writing code, you define a few configuration file to select your input and output data, and to fully describe your spark pipeline. You also specify the complete cycle of execution. For example : train on every last day of data, and detect on today's live data. That is it. You submit that to the platform and it will be scheduled accordingly.
This file provides to the PunchPlatform runtime command-line operator tools the necessary cluster configuration, among others: / - the hosts and ports of all the PunchPlatform dependencies (i.e. storm cluster, zookeeper cluster[s], elasticsearch cluster[s]) - the zookeeper and metrics root path /
This file is \'almost\' in JSON format. Almost because JSON does not support comments, while you can use #-prefixed lines to include comments in your PunchPlatform properties file. Everything else is pure json.
This file is generated by the deployment tool, from the punchplatform-deployment.settings file. Manual changes should not be applied to the punchplatform.properties file.
In an operator environment, this file can be located through
To scale up, Storm actually distributes the topology components over several processes (workers) spread over several servers. This is the standard Storm strategy. We call the resulting topologies remote, because they are submitted to a Storm cluster.
In contrast to channel, a punchplatform service is a task with a cron approach. Generally, theses tasks are started in a shiva cluster. Task examples: update elasticsearch mapping, run housekeeping function etc...
Security Information and Event Management, made of SIM (see also Log Management) and SEM.
This term references an unwanted circumstance within a distributed system which may occur because of a partitioning of the system, caused by a network failure, if two (or more) separate subsets of the system continue to process data on their own, each of the subsets being convinced of being "in charge" of the reference data. Then, data inconsistencies may be created when the system is reunited after repair of the network disruption. To avoid these inconsistencies, clustered application are designed to handle or avoid the splitbrain circumstance. For example, Zookeeper clusters can not fall into a splitbrain pit because of inbuilt restriction that only a group of strictly-more-than-half cluster nodes can operate. Smaller group of cluster nodes will stop handling any request. This explains why it is not possible to have a resilient Zookeeper cluster on less than 3 physical servers.
A spout is a Storm topology component in charge of ingesting data into the topology. The PunchPlatform provides ready to use Kafka and TCP/UDP/SSL socket spouts.
A stage is a spark pipeline step. Refer to the spark mllib documentation.
An all-bundled PunchPlatform zip archive that includes all the required dependencies (Elasticsearch, Zookeeper, Kafka, Kibana, storm) in order to setup and run a standalone configuration. It runs in minutes and is supported on most Unix variants.
A demonstration/test configuration where a PunchPlatform cluster runs on a single server, including all its dependencies (Elasticsearch, zookeeper, Kafka, Kibana, Storm), all configured to run on the same node (without cluster or replication configuration). A standalone configuration can be installed without user privileges using PunchPlatform
install.sh(see man page).
This is a standard Apache Storm concept. Inside punchplatform we refer to "storm-like punchline" instead
Storm is an Apache open-source resilient execution platform to run distributed, resilient real-time flow processing chains. Each such processing chain is called topology, basically a chain of data fetching components, called spouts and of processing/output components called bolts (this is Storm terminology, please refer to the many excellent public Storm online docs. For an overview of the topology lifecycle, see http://storm.apache.org/documentation/Lifecycle-of-a-topology.html).