Log Collector (LTR)¶

Abstract

This chapter explain the deployment of a resilient Punchplatform Log Collector. In older docs, it was called Log Transport Receiver (LTR).

One Node Setup¶

The collector is deployed on a single physical server.
If log buffering is required a single node Kafka broker is required.
A single node Shiva cluster is in charge of starting the punchlines.
Punchlines are in charge of :
- receiving the traffic from the collected zone and saving it into Kafka.
- consuming Kafka and forwarding the traffic to the Punch Log Central.

Three Nodes Setup (Highly Available setup)¶

The collector must be deployed on 3 underlying physical servers.
Kafka, Zookeeper and Shiva clusters are deployed on 3 VMs or containers.
Punchlines are in charge of
- receiving the traffic from the collected zone and saving it into Kafka.
- consuming Kafka and forwarding the traffic to the Log Central.

The multi nodes setup allows to:

listen for logs arrival on multiple servers at the same time, allowing for high availability of input point through the deployment of a classical system-level 'Virtual IP' management service clustered on these multiple servers. 1 listening punchline is located on each of the input servers of the collector site.
have replicated retention data, with copies of each data fragment on 2 of the 3 servers (replication is handled natively by Kafka)
provide availability of internal processing (retention, forwarding) through
- inbuilt Kafka resilience to single-node failure
- capability of the shiva cluster to restart tasks (e.g. forwarding punchline) on other nodes in case of node failure

The Zookeeper 3-nodes cluster ensure data integrity and availability even in case of network partitioning (one node being separated from the others).

That way, the cluster can rely on a strict majority about the data status.

This makes self-restoration of the collection service and data retention more reliable when a faulty node is repaired and joins the other two surviving nodes cluster. A physical 2-server setup is likely to cause merge conflict, and therefore would imply more manual operation to handle some incidents, and more potential data loss.

Except the "input" punchline, that is running once for each input server, the other processing and forwarding punchlines can run in only 1 instance. They are scheduled automatically to a random node by Shiva Cluster, and respawned elsewhere in case of node failure.

Warning

Legacy architecture based on only two-nodes is not advised: as for any distributed system, two nodes is not able to ensure data integrity in case of network partitioning. This makes data replication unpractical (not possible to have a resilient kafka cluster with data replication).

Having two independent nodes that only share a 'virtual ip' cluster is therefore a rough way to provide High Availability of service, but may require much more human effort in incident management, or data loss in case of hardware failure while retention is in effect (i.e. not all data was transmitted to central site)

Key design highlights¶

Metadata gathering at entry point¶

Logs received from the network are enriched with associated metadata at entry node in the collector punchline. This is useful to track where the logs network frame came from (ppf_remote_host), what was the reception port (_ppf_remote_port), what was the reception timestamp (_ppf_timestamp).

By assigning a unique internal id (_ppf_id) to each log at the entry point, we can later reuse this id for deduplication purpose in case of 'replay' of part of the flow, because of an incident or communication instability.

See syslog input node in Collector site input punchline example for reference configuration example of metadata published with the log flow.

Virtual IP addresses for High Availability of logs reception¶

To ensure High Availability of the logs input listener ports, there are 2 reference patterns :

Smart log senders can be configured with 2 target IP Addresses, therefore no VIP is needed. The sender will switch to the other receiver node if the TCP connection cannot be established.
Many log senders can only be configured with only 1 target IP Address, though. Thus, the listening input punchline on a remote collection site is running on multiple input servers, and a unique IP Address is chosen for this cluster of servers, as the log sender target. This 'Virtual IP' is used by only one server at a time, through a cluster of daemons (either pacemaker/corosync, keepalives...) that communicate with each other, and ensure that the IP is correctly published once at any time.

Important

The Virtual IP cluster of daemons must be configured to cause a Virtual IP placement only on a server where the listening port is indeed active. This allows to cope with software failure or maintenance : if the input punchline is stopped on one server, then the Virtual IP is expected to be placed automatically on the other (normally working) input punchline server, even though the server itself is not down.

In both cases, a listening punchline instance must be located on each of the input servers. To achieve this, we obtain fixed placement of each punchline instance on a specific host by using shiva 'tags' constraint. See Collector site channel structure example below.

Remote platform monitoring¶

Collection site are collecting events and metrics, and forwarding them to central site for central monitoring of collection sites. Please refer to Platform logs management/monitoring applications and queuing overview reference architecture documentation and to reference configuration example of monitoring channel

Reference configuration example¶

Collector site collection/forwarding channel structure example¶

This is an example of a collection site with 2 nodes, so 2 instances for each input punchline. Only 1 instance of the forwarding punchline is needed for High Availability, due to the clustered nature of Shiva Scheduler. Indeed, it will ensure that the forwarding topology is restarted on an other node of the cluster in case of server failure.

In our example, a single channel is handling multiple types of incoming events (see punchline example below).

# This channel receives logs on multiple ports of the collector.
# It stores them to a retention Kafka topic, enriches logs with equipment,
# before forwarding them to a Central site for processing/storage/indexing.
version: "6.0"
start_by_tenant: true
stop_by_tenant: true
applications:

  # input on collector1
  - name: input1 
    runtime: shiva
    cluster: shiva_collector
    shiva_runner_tags:
      - collector1
    command: punchlinectl
    args:
      - start
      - --punchline
      - input.yaml
      - --childopts
      - -Xms96m -Xmx96m # will override topology.component.resources.onheap.memory.mb

  # input on collector2
  - name: input2
    runtime: shiva
    cluster: shiva_collector
    shiva_runner_tags:
      - collector2
    command: punchlinectl
    args:
      - start
      - --punchline
      - input.yaml
      - --childopts
      - -Xms96m -Xmx96m

  # parser to add equipment info
  - name: parser
    runtime: shiva
    cluster: shiva_collector
    shiva_runner_tags:
      - shiva_collector
    command: punchlinectl
    args:
      - start
      - --punchline
      - parser.yaml
      - --childopts
      - -Xms96m -Xmx96m

  # forwarder to central site 
  - name: forwarder
    runtime: shiva
    cluster: shiva_collector
    command: punchlinectl
    shiva_runner_tags:
      - shiva_collector
    args:
      - start
      - --punchline
      - forwarder.yaml
      - --childopts
      - -Xms96m -Xmx96m

    # store to local filesystem
  - name: to_flat_storage
    runtime: shiva
    cluster: shiva_collector
    command: punchlinectl
    args:
      - start
      - --punchline
      - to_flat_storage.yaml
      - --childopts
      - -Xms96m -Xmx96m

# Kafka topics required for this channel
resources:
  - type: kafka_topic
    name: reftenant-collector-input
    partitions: 1
    replication_factor: 2

  - type: kafka_topic
    name: reftenant-parser-output
    partitions: 1
    replication_factor: 2

Collector site input punchline example¶

This is an example of a receiver punchline listening on 3 different tcp ports for incoming logs. Here the hypothesis is that - port 9999 will receive syslog logs of apache_httpd devices. - port 8888 will receive syslog logs from winlogbeat devices. - port 7777 will receive syslog logs from linux devices

The idea here is to multiplex all these logs in a single kafka retention queue. Later, the log central site will dispatch the logs to the appropriate processing topology based on the port number that received the log.

# Listens to 2 different subsystems and stores raw logs into Kafka.
# It will be launched on multiple nodes for resiliency and performance.
version: "6.0"
type: punchline
name: input
tenant: reftenant
channel: multi_subsystems
runtime: shiva
dag:

  # Syslog input subsystem1 (apache)
  - type: syslog_input
    component: subsystem1_syslog_input
    settings:
      listen:
        proto: tcp
        host: collectorX # will be resolved
        port: 9999
      self_monitoring.activation: true
      self_monitoring.period: 60
    publish:
      - stream: logs
        fields:
          - log
          - _ppf_local_host
          - _ppf_local_port
          - _ppf_remote_host
          - _ppf_remote_port
          - _ppf_timestamp
          - _ppf_id
      - stream: _ppf_metrics
        fields:
          - _ppf_latency

  # Syslog input subsystem2 (winlogbeat)
  - type: syslog_input
    component: subsystem2_syslog_input
    settings:
      listen:
        proto: tcp
        host: collectorX # will be resolved
        port: 8888
      self_monitoring.activation: true
      self_monitoring.period: 10
    publish:
      - stream: logs
        fields:
          - log
          - _ppf_local_host
          - _ppf_local_port
          - _ppf_remote_host
          - _ppf_remote_port
          - _ppf_timestamp
          - _ppf_id
      - stream: _ppf_metrics
        fields:
          - _ppf_latency

  # Syslog input subsystem3 (linux)
  - type: syslog_input
    component: subsystem3_syslog_input
    settings:
      listen:
        proto: tcp
        host: collectorX # will be resolved
        port: 7777
      self_monitoring.activation: true
      self_monitoring.period: 10
    publish:
      - stream: logs
        fields:
          - log
          - _ppf_local_host
          - _ppf_local_port
          - _ppf_remote_host
          - _ppf_remote_port
          - _ppf_timestamp
          - _ppf_id
      - stream: _ppf_metrics
        fields:
          - _ppf_latency

  # Kafka output
  - type: kafka_output
    settings:
      topic: reftenant-collector-input
      encoding: lumberjack
      producer.acks: "1"
    subscribe:
      - stream: logs
        component: subsystem1_syslog_input
      - stream: _ppf_metrics
        component: subsystem1_syslog_input
      - stream: logs
        component: subsystem2_syslog_input
      - stream: _ppf_metrics
        component: subsystem2_syslog_input
      - stream: logs
        component: subsystem3_syslog_input
      - stream: _ppf_metrics
        component: subsystem3_syslog_input

metrics:
  reporters:
    - type: kafka
  reporting_interval: 60

settings:
  topology.max.spout.pending: 2000
  topology.enable.message.timeouts: true
  topology.message.timeout.secs: 30

Collector site collected events forwarder punchline example¶

This is an example of a forwarding punchline to a single central site with two reception servers on central site for high availability.

# Forwards parsed logs to central site.
version: "6.0"
tenant: reftenant
channel: collector
runtime: shiva
type: punchline
name: forwarder
dag:

  # Kafka input
  - type: kafka_input
    component: kafka_input
    settings:
      topic: reftenant-parser-output
      encoding: lumberjack
      start_offset_strategy: last_committed
      self_monitoring.activation: true
      self_monitoring.period: 60
      fail_action: sleep
      fail_sleep_ms: 50
    publish:
      - stream: logs
        fields:
          - _ppf_id
          - _ppf_timestamp
          - log
          - subsystem
          - log_format
          - _ppf_partition_id
          - _ppf_partition_offset
      - stream: _ppf_metrics
        fields:
          - _ppf_latency

  # Lumberjack output to central-back
  - type: lumberjack_output
    component: lumberjack_output_logs
    settings:
      destination:
        - port: 2002
          host: centralfront1
          compression: true
          ssl: true
          ssl_client_private_key: "@{PUNCHPLATFORM_SECRETS_DIR}/server.pem"
          ssl_certificate: "@{PUNCHPLATFORM_SECRETS_DIR}/server.crt"
          ssl_trusted_certificate: "@{PUNCHPLATFORM_SECRETS_DIR}/fullchain-central.crt"
        - port: 2002
          host: centralfront2
          compression: true
          ssl: true
          ssl_client_private_key: "@{PUNCHPLATFORM_SECRETS_DIR}/server.pem"
          ssl_certificate: "@{PUNCHPLATFORM_SECRETS_DIR}/server.crt"
          ssl_trusted_certificate: "@{PUNCHPLATFORM_SECRETS_DIR}/fullchain-central.crt"
    subscribe:
      - stream: logs
        component: kafka_input
      - stream: _ppf_metrics
        component: kafka_input

# Topology metrics
metrics:
  reporters:
    - type: kafka
  reporting_interval: 60

# Topology settings
settings:
  topology.max.spout.pending: 2000
  topology.enable.message.timeouts: true
  topology.message.timeout.secs: 30

Collector site configuration example for monitoring channel¶

Only 2 monitoring tasks are locally needed on a remote collection site :

a local platform monitoring service (computing the synthesis of the local platform framework health). See configuration example below
a forwarding task for all locally collected platform events. See Platform logs management/monitoring applications and queuing overview.

Here is an example of the monitoring channel :

# This channel handles ltr/collection site monitoring, composed of :
# - platform health service for the site, that will write its result to a local kafka topic (as all local shiva logs, operator events will be also collected through the 'reporters' configuration)
# - forwarding of logs from this local kafka topic to a central/back-office site that will store them for remote monitoring, history and dashboarding
# This is a reference configuration item developed for DAVE 6.1 release - 31/08/2020 by CVF

version: "6.0"
start_by_tenant: true
stop_by_tenant: true
applications:

  # This is a monitoring service, so we run it in the processing_shiva, because the other shiva (the front cluster) 
  # would not have access to all instances of the services  (network isolation/firewalling is the main reason to have a separate front tier).
  # platform-monitoring is a builtin punchplatform micro-service, automatically available on shiva runner nodes
  - name: platform_health
    runtime: shiva
    cluster: shiva_collector
    command: platform-monitoring
    args:
      - platform_health.yaml
      - --childopts
      - -Xms48m -Xmx96m


    # This is a punchline in charge of reading events and logs from the central reporter kafka topic 
    # and of sending them to appropriate various elasticsearch indices.
    # Note that associated kafka topic is listed in the 'resources' section.
  - name: events_forwarder
    runtime: shiva
    cluster: shiva_collector
    command: punchlinectl
    args:
      - start
      - --punchline
      - events_forwarder.yaml
      - --childopts
      - -Xms96m -Xmx96m

# Here we describe kafka topic resource, that are needed for proper working of the channels application.
# Here, the required kafka topic will be created automatically
# by 'channelctl' at start of channel, if it does not already exist
# Here, only 1 partition is required (i.e. we do not expect a performance or scalability constraints for the monitoring topic, 
# because kafka is normally able to handle the events and logs of a platform with a single partition.)
# Replication factor is set to 2 (original + 1 copy) so that we avoid losing important events in case of kafka incident 
# (e.g. we want to keep the audit trail events of operator start/stop actions).
resources:
  - type: kafka_topic
    name: platform-monitoring
    partitions: 1
    replication_factor: 2

Collector site reference configuration example for platform health monitoring service¶

monitoring_interval: 30
services:
  - kafka
  - shiva
  - zookeeper
reporters:
  - type: kafka
    topic: platform-monitoring
    reporting_interval: 60
    encoding: json