Log Collector (LTR)¶
Abstract
This chapter explain the deployment of a resilient Punchplatform Log Collector.
In older docs, it was called Log Transport Receiver
(LTR).
One Node Setup¶
- The collector is deployed on a single physical server.
- If log buffering is required a single node Kafka broker is required.
- A single node Shiva cluster is in charge of starting the punchlines.
- Punchlines are in charge of :
- receiving the traffic from the collected zone and saving it into Kafka.
- consuming Kafka and forwarding the traffic to the Punch Log Central.
Three Nodes Setup (Highly Available setup)¶
- The collector must be deployed on 3 underlying physical servers.
- Kafka, Zookeeper and Shiva clusters are deployed on 3 VMs or containers.
- Punchlines are in charge of
- receiving the traffic from the collected zone and saving it into Kafka.
- consuming Kafka and forwarding the traffic to the Log Central.
The multi nodes setup allows to:
- listen for logs arrival on multiple servers at the same time, allowing for high availability of input point through the deployment of a classical system-level 'Virtual IP' management service clustered on these multiple servers. 1 listening punchline is located on each of the input servers of the collector site.
- have replicated retention data, with copies of each data fragment on 2 of the 3 servers (replication is handled natively by Kafka)
- provide availability of internal processing (retention, forwarding) through
- inbuilt Kafka resilience to single-node failure
- capability of the shiva cluster to restart tasks (e.g. forwarding punchline) on other nodes in case of node failure
The Zookeeper 3-nodes cluster ensure data integrity and availability even in case of network partitioning (one node being separated from the others).
That way, the cluster can rely on a strict majority about the data status.
This makes self-restoration of the collection service and data retention more reliable when a faulty node is repaired and joins the other two surviving nodes cluster. A physical 2-server setup is likely to cause merge conflict, and therefore would imply more manual operation to handle some incidents, and more potential data loss.
Except the "input" punchline, that is running once for each input server, the other processing and forwarding punchlines can run in only 1 instance. They are scheduled automatically to a random node by Shiva Cluster, and respawned elsewhere in case of node failure.
Warning
Legacy architecture based on only two-nodes is not advised: as for any distributed system, two nodes is not able to ensure data integrity in case of network partitioning. This makes data replication unpractical (not possible to have a resilient kafka cluster with data replication).
Having two independent nodes that only share a 'virtual ip' cluster is therefore a rough way to provide High Availability of service, but may require much more human effort in incident management, or data loss in case of hardware failure while retention is in effect (i.e. not all data was transmitted to central site)
Key design highlights¶
Metadata gathering at entry point¶
Logs received from the network are enriched with associated metadata at entry node in the collector punchline.
This is useful to track where the logs network frame came from (ppf_remote_host
), what was the reception port (_ppf_remote_port
), what was the reception timestamp (_ppf_timestamp
).
By assigning a unique internal id (_ppf_id
) to each log at the entry point, we can later reuse this id for deduplication purpose in case of 'replay' of part of the flow, because of
an incident or communication instability.
See syslog input node in Collector site input punchline example for reference configuration example of metadata published with the log flow.
Virtual IP addresses for High Availability of logs reception¶
To ensure High Availability of the logs input listener ports, there are 2 reference patterns :
-
Smart log senders can be configured with 2 target IP Addresses, therefore no VIP is needed. The sender will switch to the other receiver node if the TCP connection cannot be established.
-
Many log senders can only be configured with only 1 target IP Address, though. Thus, the listening input punchline on a remote collection site is running on multiple input servers, and a unique IP Address is chosen for this cluster of servers, as the log sender target. This 'Virtual IP' is used by only one server at a time, through a cluster of daemons (either pacemaker/corosync, keepalives...) that communicate with each other, and ensure that the IP is correctly published once at any time.
Important
The Virtual IP cluster of daemons must be configured to cause a Virtual IP placement only on a server where the listening port is indeed active. This allows to cope with software failure or maintenance : if the input punchline is stopped on one server, then the Virtual IP is expected to be placed automatically on the other (normally working) input punchline server, even though the server itself is not down.
In both cases, a listening punchline instance must be located on each of the input servers. To achieve this, we obtain fixed placement of each punchline instance on a specific host by using shiva 'tags' constraint. See Collector site channel structure example below.
Remote platform monitoring¶
Collection site are collecting events and metrics, and forwarding them to central site for central monitoring of collection sites. Please refer to Platform logs management/monitoring applications and queuing overview reference architecture documentation and to reference configuration example of monitoring channel
Reference configuration example¶
Collector site collection/forwarding channel structure example¶
This is an example of a collection site with 2 nodes, so 2 instances for each input punchline. Only 1 instance of the forwarding punchline is needed for High Availability, due to the clustered nature of Shiva Scheduler. Indeed, it will ensure that the forwarding topology is restarted on an other node of the cluster in case of server failure.
In our example, a single channel is handling multiple types of incoming events (see punchline example below).
# This channel receives logs on multiple ports of the collector.
# It stores them to a retention Kafka topic, enriches logs with equipment,
# before forwarding them to a Central site for processing/storage/indexing.
version: "6.0"
start_by_tenant: true
stop_by_tenant: true
applications:
# input on collector1
- name: input1
runtime: shiva
cluster: shiva_collector
shiva_runner_tags:
- collector1
command: punchlinectl
args:
- start
- --punchline
- input.yaml
- --childopts
- -Xms96m -Xmx96m # will override topology.component.resources.onheap.memory.mb
# input on collector2
- name: input2
runtime: shiva
cluster: shiva_collector
shiva_runner_tags:
- collector2
command: punchlinectl
args:
- start
- --punchline
- input.yaml
- --childopts
- -Xms96m -Xmx96m
# parser to add equipment info
- name: parser
runtime: shiva
cluster: shiva_collector
shiva_runner_tags:
- shiva_collector
command: punchlinectl
args:
- start
- --punchline
- parser.yaml
- --childopts
- -Xms96m -Xmx96m
# forwarder to central site
- name: forwarder
runtime: shiva
cluster: shiva_collector
command: punchlinectl
shiva_runner_tags:
- shiva_collector
args:
- start
- --punchline
- forwarder.yaml
- --childopts
- -Xms96m -Xmx96m
# store to local filesystem
- name: to_flat_storage
runtime: shiva
cluster: shiva_collector
command: punchlinectl
args:
- start
- --punchline
- to_flat_storage.yaml
- --childopts
- -Xms96m -Xmx96m
# Kafka topics required for this channel
resources:
- type: kafka_topic
name: reftenant-collector-input
partitions: 1
replication_factor: 2
- type: kafka_topic
name: reftenant-parser-output
partitions: 1
replication_factor: 2
Collector site input punchline example¶
This is an example of a receiver punchline listening on 3 different tcp ports for incoming logs. Here the hypothesis is that - port 9999 will receive syslog logs of apache_httpd devices. - port 8888 will receive syslog logs from winlogbeat devices. - port 7777 will receive syslog logs from linux devices
The idea here is to multiplex all these logs in a single kafka retention queue. Later, the log central site will dispatch the logs to the appropriate processing topology based on the port number that received the log.
# Listens to 2 different subsystems and stores raw logs into Kafka.
# It will be launched on multiple nodes for resiliency and performance.
version: "6.0"
type: punchline
name: input
tenant: reftenant
channel: multi_subsystems
runtime: shiva
dag:
# Syslog input subsystem1 (apache)
- type: syslog_input
component: subsystem1_syslog_input
settings:
listen:
proto: tcp
host: collectorX # will be resolved
port: 9999
self_monitoring.activation: true
self_monitoring.period: 60
publish:
- stream: logs
fields:
- log
- _ppf_local_host
- _ppf_local_port
- _ppf_remote_host
- _ppf_remote_port
- _ppf_timestamp
- _ppf_id
- stream: _ppf_metrics
fields:
- _ppf_latency
# Syslog input subsystem2 (winlogbeat)
- type: syslog_input
component: subsystem2_syslog_input
settings:
listen:
proto: tcp
host: collectorX # will be resolved
port: 8888
self_monitoring.activation: true
self_monitoring.period: 10
publish:
- stream: logs
fields:
- log
- _ppf_local_host
- _ppf_local_port
- _ppf_remote_host
- _ppf_remote_port
- _ppf_timestamp
- _ppf_id
- stream: _ppf_metrics
fields:
- _ppf_latency
# Syslog input subsystem3 (linux)
- type: syslog_input
component: subsystem3_syslog_input
settings:
listen:
proto: tcp
host: collectorX # will be resolved
port: 7777
self_monitoring.activation: true
self_monitoring.period: 10
publish:
- stream: logs
fields:
- log
- _ppf_local_host
- _ppf_local_port
- _ppf_remote_host
- _ppf_remote_port
- _ppf_timestamp
- _ppf_id
- stream: _ppf_metrics
fields:
- _ppf_latency
# Kafka output
- type: kafka_output
settings:
topic: reftenant-collector-input
encoding: lumberjack
producer.acks: "1"
subscribe:
- stream: logs
component: subsystem1_syslog_input
- stream: _ppf_metrics
component: subsystem1_syslog_input
- stream: logs
component: subsystem2_syslog_input
- stream: _ppf_metrics
component: subsystem2_syslog_input
- stream: logs
component: subsystem3_syslog_input
- stream: _ppf_metrics
component: subsystem3_syslog_input
metrics:
reporters:
- type: kafka
reporting_interval: 60
settings:
topology.max.spout.pending: 2000
topology.enable.message.timeouts: true
topology.message.timeout.secs: 30
Collector site collected events forwarder punchline example¶
This is an example of a forwarding punchline to a single central site with two reception servers on central site for high availability.
# Forwards parsed logs to central site.
version: "6.0"
tenant: reftenant
channel: collector
runtime: shiva
type: punchline
name: forwarder
dag:
# Kafka input
- type: kafka_input
component: kafka_input
settings:
topic: reftenant-parser-output
encoding: lumberjack
start_offset_strategy: last_committed
self_monitoring.activation: true
self_monitoring.period: 60
publish:
- stream: logs
fields:
- _ppf_id
- _ppf_timestamp
- log
- subsystem
- log_format
- _ppf_partition_id
- _ppf_partition_offset
- stream: _ppf_metrics
fields:
- _ppf_latency
# Lumberjack output to central-back
- type: lumberjack_output
component: lumberjack_output_logs
settings:
destination:
- port: 2002
host: centralfront1
compression: true
ssl: true
ssl_client_private_key: "@{PUNCHPLATFORM_SECRETS_DIR}/server.pem"
ssl_certificate: "@{PUNCHPLATFORM_SECRETS_DIR}/server.crt"
ssl_trusted_certificate: "@{PUNCHPLATFORM_SECRETS_DIR}/fullchain-central.crt"
- port: 2002
host: centralfront2
compression: true
ssl: true
ssl_client_private_key: "@{PUNCHPLATFORM_SECRETS_DIR}/server.pem"
ssl_certificate: "@{PUNCHPLATFORM_SECRETS_DIR}/server.crt"
ssl_trusted_certificate: "@{PUNCHPLATFORM_SECRETS_DIR}/fullchain-central.crt"
subscribe:
- stream: logs
component: kafka_input
- stream: _ppf_metrics
component: kafka_input
# Topology metrics
metrics:
reporters:
- type: kafka
reporting_interval: 60
# Topology settings
settings:
topology.max.spout.pending: 2000
topology.enable.message.timeouts: true
topology.message.timeout.secs: 30
Collector site configuration example for monitoring channel¶
Only 2 monitoring tasks are locally needed on a remote collection site :
- a local platform monitoring service (computing the synthesis of the local platform framework health). See configuration example below
- a forwarding task for all locally collected platform events. See Platform logs management/monitoring applications and queuing overview.
Here is an example of the monitoring channel :
# This channel handles ltr/collection site monitoring, composed of :
# - platform health service for the site, that will write its result to a local kafka topic (as all local shiva logs, operator events will be also collected through the 'reporters' configuration)
# - forwarding of logs from this local kafka topic to a central/back-office site that will store them for remote monitoring, history and dashboarding
# This is a reference configuration item developed for DAVE 6.1 release - 31/08/2020 by CVF
version: "6.0"
start_by_tenant: true
stop_by_tenant: true
applications:
# This is a monitoring service, so we run it in the processing_shiva, because the other shiva (the front cluster)
# would not have access to all instances of the services (network isolation/firewalling is the main reason to have a separate front tier).
# platform-monitoring is a builtin punchplatform micro-service, automatically available on shiva runner nodes
- name: platform_health
runtime: shiva
cluster: shiva_collector
command: platform-monitoring
args:
- platform_health.yaml
- --childopts
- -Xms48m -Xmx96m
# This is a punchline in charge of reading events and logs from the central reporter kafka topic
# and of sending them to appropriate various elasticsearch indices.
# Note that associated kafka topic is listed in the 'resources' section.
- name: events_forwarder
runtime: shiva
cluster: shiva_collector
command: punchlinectl
args:
- start
- --punchline
- events_forwarder.yaml
- --childopts
- -Xms96m -Xmx96m
# Here we describe kafka topic resource, that are needed for proper working of the channels application.
# Here, the required kafka topic will be created automatically
# by 'channelctl' at start of channel, if it does not already exist
# Here, only 1 partition is required (i.e. we do not expect a performance or scalability constraints for the monitoring topic,
# because kafka is normally able to handle the events and logs of a platform with a single partition.)
# Replication factor is set to 2 (original + 1 copy) so that we avoid losing important events in case of kafka incident
# (e.g. we want to keep the audit trail events of operator start/stop actions).
resources:
- type: kafka_topic
name: platform-monitoring
partitions: 1
replication_factor: 2
Collector site reference configuration example for platform health monitoring service¶
monitoring_interval: 30
services:
- kafka
- shiva
- zookeeper
reporters:
- type: kafka
topic: platform-monitoring
reporting_interval: 60
encoding: json