Skip to content

Log Collector/Forwarder

Abstract

This chapter explain the recommended deployment of a resilient punch log collector.

This configuration is referred to as a LTR.

One Node Setup

  • The collector is deployed on a single physical server.
  • If log buffering is required a single node Kafka broker is required
  • A single node Shiva cluster is in charge of starting the punchlines
  • Punchlines are in charge of
    • receiving the traffic from the collected zone and saving it into Kafka
    • consuming Kafka and forwarding the traffic to the central punch site

Three Nodes Setup

This setup is highly available.

  • The collector must be deployed on three underlying physical servers (not merely 3 logical VMs on less physical servers).
  • The Kafka, Zookeeper and Shiva cluster are deployed on three VMs or containers.
  • Punchlines are in charge of
    • receiving the traffic from the collected zone and saving it into Kafka
    • consuming Kafka and forwarding the traffic to the central punch site

The multi nodes setup allows to:

  • listen for logs arrival on multiple servers at the same time, allowing for high availability of input point through the deployment of a classical system-level 'Virtual IP' management service clustered on these multiple servers. 1 listening punchline is located on each of the input servers of the collector site.
  • have replicated retention data, with copies of each data fragment on 2 of the 3 servers (replication is handled natively by Kafka)
  • provide availability of internal processing (retention, forwarding) through
    • inbuilt Kafka resilience to single-node failure
    • capability of the shiva cluster to restart tasks (e.g. forwarding punchline) on other nodes in case of node failure

The 3 zookeeper nodes cluster ensure data integrity and availability even in case of network partitioning (one node being separated from the others).

That way, the cluster can rely on having a strict majority of the nodes in order to know it possess 'the truth' about the data status.

This makes self-restoration of the collection service and data retention more reliable when a faulty node is repaired and joins the other two surviving nodes cluster. A 2 physical server setup is likely to cause merge conflict, and therefore would imply more manual operation to handle some incidents, and more potential data loss.

Except the "input" punchline, that is running once for each input server, the other processing/forwarding punchline of the collector site can run in only 1 instance, scheduled automatically to a random node by Shiva Cluster, and respawned elsewhere in case of node failure.

Warning

Legacy architecture based on only two-nodes is not advised: as for any distributed system, two nodes is not able to ensure data integrity in case of network partitioning. This makes data replication unpractical (not possible to have a resilient kafka cluster with data replication).

Having two independent nodes that only share a 'virtual ip' cluster is therefore a rough way to provide High Availability of service, but may require much more human effort in incident management, or data loss in case of hardware failure while retention is in effect (i.e. not all data was transmitted to central site)

Design Drivers

Metadata Gathering at Entry Point

Log received from the network are enriched with associated metadata at entry node in the collector punchline . This is useful to track where the logs network frame came from, what was the reception port, what is the exact reception timestamp.

By assigning a unique internal id to each log at the entry point, we can later reuse this id for deduplication purpose in case of 'replay' of part of the flow, because of an incident or communication instability.

See syslog input node in Collector site input punchline example for reference configuration example of metadata published with the log flow.

Virtual IP addresses for High Availability of logs reception

To ensure high availability of the logs input listener ports, there are 2 reference patterns :

  • Smart log senders can be configured with 2 target IP Addresses, therefore no VIP is needed. The sender will switch to the other receiver node if the TCP connection cannot be established.

  • Many log senders can only be configured with only 1 target IP Address, though. Thus, the listening input punchline on a remote collection site is running on multiple input servers, and a unique IP Address is chosen for this cluster of servers, as the log sender target. This 'virtual' IP is used by only one server at a time, through a cluster of daemons (either pacemaker/corosync, keepalives...) that communicate with each other, and ensure that the IP is correctly published once at any time.

!!! Important The Virtual IP cluster of daemons must be configured to cause a Virtual IP placement only on a server where the listening port is indeed active. This allows to cope with software failure or maintenance : if the input punchline is stopped on one server, then the Virtual IP is expected to be placed automatically on the other (normally working) input punchline server, even though the server itself is not down.

In both cases, a listening punchline instance must be located on each of the input servers. To achieve this, we obtain fixed placement of each punchline instance on a specific host by using shiva 'tags' constraint. See Collector site channel structure example below.

remote platform monitoring

Collection site are collecting events and metrics, and forwarding them to central site for central monitoring of collection sites. Please refer to Platform logs management/monitoring applications and queuing overview reference architecture documentation and to reference configuration example of monitoring channel

Reference configuration example

Collector site collection/forwarding channel structure example

This is an example of a collection site with 3 input nodes, so 3 instances of the input punchline, each located on a fixed input server. Only one instance of the forwarding punchline is needed for high availability, due to the clustered nature of shiva scheduler, that will ensure the forwarding topology is restarted on an other node of the cluster in case of server failure.

In our example, a single channel is handling multiple types of incoming events (see punchline example below).

tenants/reftenant/channels/ltr_multitech/channel_structure.hjson

Collector site input punchline example

This is an example of a receiver punchline listening on 3 different tcp ports for incoming logs. Here the hypothesis is that - port 1522 will receive syslog logs of apache_httpd type - port 1523 will receive syslog logs from sourcefire devices - port 1524 will receive other kind of syslog devices

The idea here is to multiplex all these logs in a single kafka retention queue, with the capability to dispatch later the logs to the appropriate central site log processing channel based on the port number that received the log

tenants/reftenant/channels/ltr_multitech/ltr_in.hjson

Collector site collected events forwarder punchline example

This is an example of a forwarder punchline for collected events targetting a single central site with two reception servers on central site for high availability.

tenants/reftenant/channels/ltr_multitech/ltr_out.hjson

Collector site configuration example for monitoring channel

Only 2 monitoring tasks are locally needed on a remote collection site : - a local platform monitoring service (computing the synthesis of the local platform framework health) ==> see configuration example below - a forwarding task for all locally collected platform events (see Platform logs management/monitoring applications and queuing overview reference architecture documentation. )

Here is an example of the monitoring channel :

tenants/platform/channels/monitoring/channel_structure.hjson

Collector site reference configuration example for platform health monitoring service

tenants/platform/channels/monitoring/platform_health.json