Skip to content

Overview

Abstract

Every data processing pipeline, every administrative task is defined as part of a tenant. In each tenant, you further organize your processing units into channels. A channel is a set of task, each defined as a punchline.

Channels may also contain administrative tasks to provide you with your data lifecycle configuration, or third party components such as logstash, or even yours. Whatever mixture of punchlines or third party components you group in a channel,

Architecture

The punch lets you define, orchestrate and run various applications using a very simple architecture. Here is the overall view of how the punch allows a user to submit different types of applications to various processing engines.

image image

where:

  1. is a shared folder equipped with revision control capabilities. This repository contains all the platform and application configuration files.
  2. represents an administrative console for the sake of illustrating how a terminal user can start an application.
  3. are (typical) examples of applications launched and managed by the punch.
    • a storm punchline used for streaming use cases
    • a spark punchline used for batch or streaming analytics processing
    • a third-party application. An example is a logstash process, it could be your own.

Configuration

Here is a typical punchplatform configuration folder layout.

└── conf
    ├── punchplatform.properties
    ├── resolv.hjson
    ├── resources
    │   ├── groks
    │   ├── mappings
    │   ├── models
    │   └── punchlets
    └── tenants
        ├── customer1
        │   ├── channel
        │   │   ├── admin
        │   │   │   ├── channel_structure.hjson
        │   │   │   └── housekeeping_punchline.hjson
        │   │   └── apache_httpd
        │   │       ├── archiving_punchline.hjson
        │   │       ├── channel_structure.hjson
        │   │       └── parsing_punchline.hjson
        │   └── etc
        │       └── conf.json
        └── platform
            ├── channel
            │   └── admin
            │       ├── channel_structure.hjson
            │       └── monitoring_punchline.hjson
            └── etc
                └── conf.json
  • conf refers to the root folder where all configuration files are stored. The location of that folder is defined in the $PUNCHPLATFORM_CONF_DIR environment variable.
  • punchplatform.properties is a property file that contain key runtime properties.
  • resolv.hjson is a property file used to resolve endpoint and security properties.
  • resources contain various shared resources your will need. Think of parsers, enrichment files, machine learning models, kibana generic dashboards.
  • tenants contains your per-tenant configuration. Remember everything is defined in a tenant.
    • the reserved platform tenant is used for platform level applications. Some monitoring or housekeeping tasks are typically defined at that level. End user do not have access to this tenant, only the platform administrators.
    • here the customer1 tenant is a fictive end user example.
    • each tenant has a few additional properties defined in the etc/conf.json folder.
  • channels all applications are grouped in ways you decide in channels.
  • channel_structure.hjson each channel (for instance 'admin' or 'apache_httpd') is defined using this file. It basically defines the channel content.
  • punchlines.hjson individual applications are defined by punchlines.

In reality you can have more applications than sketched out here, including third-party apps. But this minimal description suffices for you to understand the punch.

Tip

on simple platform this directory is the top-level root folder holding all the platform files. On production platforms the configuration is stored on some backend and exposed through the REST punch gateway API.

A key benefit of the punch is to keep it extra simple : you essentially deal with the configuration files to fully define your tenants, channels and application scheduling policies. In the rest of this chapter we explain provide several example of the right part in order for you to understand how applications are submitted, stopped and monitored.

Local Architecture

The standalone punch platform is a single node easy setup that you can install and run in minutes. It is also a good example of a simple yet effective architecture to provide all the required production-grade services yet running on a minimal footprint. Here is how the standalone application control and monitoring plane is implemented on a standalone.

image

A punch requires a persistent store to keep track of running applications. With this simple setup the local filesystem is used. The administration store is also used to request application start or stop commands. A punch daemon called shiva is in charge of executing the applications.

The standalone punch also includes Elasticsearch, Kibana and others components. Refer to the getting started guide.

Distributed Architecture

A production punch requires a high-available administration store. Instead of using tje local filesystem, kafka is used to provide the required storage engine. This is illustrated next :

image

  1. the configuration folder is on your local filesystem. It is located in the installation directory conf folder.
  2. you will be using your (linux|macos) terminal to issue administrative command.
  3. the Kafka cluster is named local. Three topics are used :
    • (4) standalone-shiva-ctl for scheduling applications to shiva
    • (5) standalone-shiva-data for scheduling applications to shiva
    • (6) standalone-admin to retain the current application(s) status
  4. a shiva cluster : this is the punch proprietary application scheduler.
  5. a spark cluster to run spark and pyspark applications.
  6. a storm cluster to run streaming punch processing applications.

Platform Properties

Configuration management of data processing platforms is a difficult issue. Here is how the punch makes it simple for you to grasp you complete configuration in a very simple set of properties.

The punchplatform.properties file holds essential informations required by all components to interact with each other. Here is an example from the standalone punchplatform.

"platform" : {
    "platform_id" : "my-unique-platform-id",
        "admin" : {
            "type": "kafka",
            "cluster" : "local",
            "topics" : {
                "admin_topic" : {
                    "name" : "standalone-admin"
                }
            }
        }
    },
    "kafka" : {
        "clusters" : {
            "local" : {
                "brokers_with_ids" : [ { "id" : 0, "broker" : "localhost:9092" } ],
            }
        }
    },
    "shiva": {
        "clusters" : {
            "platform" : {
                "type" : "kafka",
                "cluster" : "local", 
                "topics" : {
                    "control_topic" : {
                        "name" : "standalone-shiva-ctl"
                    },
                    "data_topic" : {
                        "name" : "standalone-shiva-data"
                    }
                }
            }
        }
    }
}

Tip

The punchplatform is highly modular and lightweight. Here the example platform has only the internal punch application scheduler called shiva that allows the execution of many simple and useful applications such as logstash, punch lightweigth pipelines or you own python application. Of course you can configure it with more components such as a Spark, Storm engine, plus Elasticsearch. It all depends on your use case but the principles are the same.

When you start you platform the following topics are created :

bin/kafka-topics.sh \
    --bootstrap-server localhost:9092 \
    --create --topic standalone1-admin \
    --partitions 1 \
    --replication-factor 1 \
    --config "retention.bytes=104857600" \
    --config "cleanup.policy=compact" \
    --config "delete.retention.ms=86400000" \
    --config "segment.ms=604800000" \
    --config "min.cleanable.dirty.ratio=0.5"
bin/kafka-topics.sh \
    --bootstrap-server localhost:9092 \
    --create --topic platform-applications \
    --partitions 1 \
    --replication-factor 1 \
    --config "cleanup.policy=compact" \
    --config "delete.retention.ms=100" \
    --config "segment.ms=100" \
    --config "min.cleanable.dirty.ratio=0.01"