Channels

Definition

This chapter explains how to define and configure a channel to process documents.

A channel is basically an end-to-end setup to transport documents from an entry point up to one or several backends.

For example, a complex log channel can chain three processing layers:

  • First: logs are received by syslog TCP listener process, forwarding the logs to a first Kafka cluster, to save them quickly. Such a forwarder is implemented as a Storm topology chaining a syslog spout to a kafka bolt.
  • Second: a topology takes the log from that first Kafka cluster, parse and transforms them on the fly, and push them in a second Kafka cluster.
  • Last: a topology finally takes the parsed logs and push them in an Elasticsearch cluster.

Such a (typical) log channel is illustrated next.

../../../_images/ChannelExample1.png

Note

channels may also be configured not as end-to-end processing of log types, but as part of more complex multi-channels structure, where a log is handled successively by several “channels” before reaching the final back-end. As an example, a typical setup is to have multiple “input/filter/processing channels” dedicated to log types and/or input points, and a unique mutualized “output” channel, that handles injection in multiple elasticsearch clusters, but for all log types coming from the first. The point of this more complex channels structure is to mutualize processes and reduce the overall memory usage.

In contrast a simpler channel can go straight from a Syslog TCP socket directly to an Elasticsearch backend, without going through intermediate Kafka brokers.

Because it generally involves several distinct parts, a channel configuration consists in several files, all grouped in a common directory $PUNCHPLATFORM_CONF_DIR/tenants. As an example, here is the layout of two of the sample channels (delivered as part of the PunchPlatform standalone installation).

punchplatform.properties
   /** main platform configuration file **/

resources
    elastalert
          /** rules of elastalert **/
    elasticsearch
          /** elasticsearch mapping templates **/
    grafana
          /** datasources and dashboards for grafana UI **/
    injector
          /** log injector to simulate logs **/
    kibana
          /** kibana dashboards **/
    punch
          /** the punchlets and associated resource files **/

templates
    lmc
        /** channel structure and topologies templates for lmc **/

tenants
    mytenant
        channels
            apache_httpd
                channel_structure.json
                single_topology.json
        configurations_for_channels_generation
            lmc
                /** lmc channel definition files such as apache_httpd.json **/
        etc
            /** tenant configuration directory **/

Before explaining the content of each, first the big picture;

  • mytenant is the name of the tenant. A PunchPlatform can process logs for many tenants in complete isolation. These names are user defined, they have been chosen in the demo setup for illustrative purpose.
  • apache_httpd is the name of a channel : i.e. all apache logs for a tenant will be handled by this channel. (This assumes of course that all such apache logs can be sent to that particular channel). This name is user defined.
  • the resources/punch directory contains the platform punchlets. You can define punchlets for all tenants, as in this example, or define them only for a tenant or a channel. The PunchPlatform comes in with a complete set of example parsers, organized in several sub folders.

In a given channel directory, you find a channel_structure.json file and one or several topology files.

  • channel_structure.json : defines the structure of the channel. There is specified that it starts from (say) a syslog tcp socket, goes through kafka brokers, storm topologies up to some Kafka broker or Elasticsearch backend(s).
    • channel_structure.json files are used by PunchPlatform commands to setup, start or stop the various parts composing a channel*.
    • it also contains autotest input and output point, used by the platform to automatically test the status of that channels.
  • single_topology.json, lmr_in_topology.json, processing_topology.json … : these are topology definition files. Each contains the configuration required to completely instantiate a Storm, Spark or Flink job. Their name is user defined.

Because most if not all your channels will be similar, and at the end will depend on your architecture, the PunchPlatform lets you generate these JSON files (channel_structure and topologies) from templates. In the rest of this chapter we explain one configuration file, the one synthetizing the overall structure of a channel.

Channel Structure

Storm topologies

The channel_structure.json file of each channel has the following content:

{
  "topologies" : [
    {
      "topology" : "lmr_in_topology.json",
      "execution_mode" : "cluster",
      "cluster" : "main",
      "reload_action" : "none"
    },
    ...
  ],

  "autotest_latency_control" : {
    "path_controls" : {
      "syslog_to_elasticsearch" : {
        "input" : { ... },
        "output" : { ... },
        "warn_threshold_s" : 5,
        "error_threshold_s" : 10
      },
      ...
    }
  }

  "metrics" : {
    "metrics_source" : { ... }
  },

  "kafka_topics" : {
    "output_topic": {
      "name" : "mytenant_arkoon_output",
      "partitions" : 4,
      "kafka_cluster" : "local",
      "replication_factor" : 1
    },
    ...
  }
}

The topologies[] array lists the one or several Storm topologies contained in the channel. Each has the following property:

  • topology : refers to a topology definition file located in the channel configuration directory.
  • execution_mode : “local” or “cluster”. The local mode makes the topology run embedded in a single process. The cluster mode makes the topology submitted to the Storm cluster. Note that local topologies do not scale.
  • reload_action : one of “none” or “kill_then_start”. Using “none” makes the topology not affected by a channel reload order. Typically the topology must stay up and running across reconfiguration to ensure service availability.

The kafka_topics dictionary lists the kafka topics required by the channel. The key associated to each section logically identifies a topic, so that it can be referred to elsewhere in configuration files. This name is not related to anything inside kafka. Each topic has the following properties:

  • name : the name of the topic as identified in Kafka.
  • cluster : the logical Kafka cluster name. A corresponding entry must appear in the $PUNCHPLATFORM_CONF_DIR/punchplatform.properties file.
  • partitions : the number of partitions for this topic. The more partitions, the more scalable is your channel.
  • replication_factor : the number of replica for each partition. 2 is a minimum to achieve high-availability.

The autotest_latency_control and metric sections define applicative monitoring, in order for the channel health to be constantly checked using latency indicators between any two arbitrary components. This is explained in detail in the monitoring chapter.

Light topologies

Configuration used for the Storm Topology runtime is also valid for Light Topologies. The difference is the topologies key replacement by shiva_topologies. For example, the following example will start the lmr_in_topology in light mode based on Shiva.

{
  "shiva_topologies" : [
    {
      "topology" : "lmr_in_topology.json",
      "shiva_runner_tags" : ["blue", "red"]
    },
    ...
  ],

  "autotest_latency_control" : {
    ...
  }

  ...
}