Skip to content

Log Management Site

Goals and Features

This reference architecture presents classical patterns for a central ("back-office") log management site.

The hypothesis are:

  • Logs are collected on remote collector (LTR) sites, and forwarder to this central site using Lumberjack lossles (acknowleged) protocol
  • The central site implements the following services :
    • Highly available logs reception from remote collection sites
    • Highly available local logs reception from central-site source devices.
    • logs parsing, normalization and enrichment, with processing adapted to the log source device type
    • (optional) Raw logs forwarding to an external system, with filtering rules based on parsed fields of the logs
    • (optional) Raw logs archiving as compressed CSV files in an S3 Storage device, with a rough-grain indexing of archive (by time scope and device type)
    • (optional) Elasticsearch indexing, enabling to search/query/aggregate logs through kibana Web HMI
    • Monitoring of central site platform framework and applications synthetic health, and publishing this health through dashboards and REST API
    • Housekeeping the old archived files and indexed logs (purging data oldest than a choosen retention delay)
    • Central Monitoring of remote collector/LTR platform framework and applications synthetic health, and publishing this health through dashboards and REST API
  • The central site must protect data and provide high service availability against, when SINGLE failure occurs.

Note that a specific dual-site reference configuration pattern is documented in a separate section.

High-level Design

Two main (logical) application planes exist in a central log management punchplatform pipelines :

  • "user logs" pipelines (for the cyber/business-oriented logs coming from source devices)
  • "platform logs / monitoring" applications

User Logs Processing Overview

The following diagram introduces the main pipelines for transporting/processing the logs collected from the various source devices.

!!! tip use the links below the diagram to jump to reference architecture highlights/reference configuration items for the component.

Reference central site Pipelines (image)

(1) Receiver punchlines (HA cluster)

(2) 'Front' kafka topics (for Fast receiving of forwarded or locally collected logs)

(3) Processors punchlines (for parsing, normalization, enrichment)

(4) 'Back' kafka topic (multiple reader because multiple data outputs )

(5) Indexer punchline (writer to Elasticsearch)

(6) Archiver punchline / meta-data writer (archive indexing)

(7) S3-API-compliant storage device for archives

(8) Syslog/TCP formatter/forwarder punchline

(9) Virtual IP for High Availability of Logs collection

Platform Logs Processing Overview

The following diagram introduces pipelines components for transporting/processing the platform and application monitoring data (logs, scheduling events, operator actions, operating system metrics, applications metrics, health reports for platform components and channels)

!!! tip use the links below the diagram to jump to reference architecture highlights/reference configuration items for the component.

Reference collection and central site Pipelines of platform events (image)

(1) Platform Events Kafka topic (json encoded)

(2) Platform events forwarder punchline

(3) Platform Events dispatcher punchlines

(4) Channels monitoring applications (1 per tenant)

(5) Local platform monitoring service

(6) Metricbeat daemon (OS metrics collector)

(7) Local Elasticsearch metrics reporters

(8) Syslog/TCP formatter/forwarder punchline

Design Drivers

Front Kafka Topics

Multiple log types arrive in a multiplexed stream. Because we may want to process them with very different Processors punchlines, it is useful to 'sort' them in separate 'input' queues. The main advantage is that if some processor is stuck because of a significant anomaly in the input log flow, the OTHER types of log will be processed normally.

These 'input' queues also serve as buffers to allow fast transfer from the remote collection sites (especially after a communication was interrupted for some time, and then repaired), while giving time to process it later, even if a maintainance or fixing action is occurring on a specific processing channel.

Typically, it is advised to have 7 days of retention on input kafka topics (Long week-end + time to solve an incident) to highly reduce risks of logs destruction during production incident management.

Back Kafka Topic

The output of the Processors punchlines is similar in format (except for the errors processing) : a parsed json document, a raw log string and some metadata.

Therefore it is possible to have only one 'output' kafka queue, and a 'big' indexer, forwarder or archiver punchline common to all logs types.

Sometimes, it can be useful to tune the indexing to Elasticsearch log type by log type, because different log types may have very different amounts of logs, leading to different sizez/sharding of Elastic indices. (output Elastic indices are often splitted by log type, because of the difference in columns/fields between different log types)

In these cases, we can have multiple output kafka topics. But this means more punchline components (punchlines/kafka input nodes) to read these multiple queues, leading to more configuration management and some more memory usage (due to JVM 'base' memory consumption of around 200 Mb).

These 'output' topics are also useful to store the logs during Elasticsearch incidents or maintainance. It is therefore advised to size these queues for 5 to 7 days of logs.

Virtual IP addresses for High Availability of logs reception

To ensure high availability of the logs input listener port, there are 2 reference patterns :

  • Smart log senders can be configured with 2 target IP Addresses, therefore no VIP is needed. The sender will switch to the other receiver node if the TCP connection cannot be established.

  • Many log senders can only be configured with only 1 target IP Address, though. Thus, the listening input punchline (on a central collection site or on a remote collection site) is running on multiple input servers, and a unique IP Address is chosen for this cluster of servers, as the logs sender target. This 'virtual' IP is used by only one server at a time, through a cluster of daemons (either pacemaker/corosync, keepalived...) that communicate with each other, and ensure that the IP is correctly published once at any time.

!!! Important The Virtual IP cluster of daemons must be configured to cause an Virtual IP placement only on a server where the listening port is indeed active. This allows to cope with software failure or maintenance : if the input punchline is stopped on one server, then the Virtual IP is expected to be placed on the other (normally working) input punchline server.

Platform events Kafka topic

On remote collection sites, we usually have no local Elasticsearch/Kibana service (to reduce the required local resources consumption). And it is often wished to have central monitoring capability for all collection sites associated to a central logs management site.

So on collection sites, we use a unique Kafka queue (named 'platform-events') to collect all monitoring/audit information we may need for this monitoring, and then we forward these information to the central site.

The internal encoding of data in this Kafka topic is JSON (and not lumberjack frames, as usually the case with most Kafka queues used by pipilines). This is because we want to be able to receive direct output from various collecting services, including 'beats' (such as metricbeat for collecting OS-level resources metrics, auditbeat to capture system events...). The beats daemons (from Elastic/logstash ecosystem) are natively supporting writing their output as Json in kafka.

On central sites, we could bypass such 'platform-events' monitoring by having direct reporting of events and metrics to Elasticsearch. But we want to have the same architecture, and the same dispatching logice of events in various Elasticsearch indices than 'remote' events. And it is often more efficient to write to Elasticsearch using an Elastic Output node of a punchline than when directly reporting metrics/events to Elasticsearch (because less documents per request in the latest case). So we are using this kafka queue also on central log management sites.

Metricbeat

Metricbeat daemon is a standard Operating-System collector lightweight daemons from the ELK ecosystem of beats. We advise deploying it on all VMs of all platforms, and keeping these metrics in central 'platform-metricbeat-' Elasticsearch indices for a few days, to help in managing incidents and investigating performance or stability issues, and to help in capacity supervision (alerting can be based on these metrics in the platform custom supervision system, by querying these metrics from Elasticsearch)

This daemon is running under direct systemd monitoring/control, so no punchplatform channel or application is associated to running this service (see platform deployment configuration).

Local Elastic metrics reporters

For some central applications, it may be desired to have metrics or events reported directly to the central monitoring Elasticsearch indices, without going through the 'platform-events' kafka topic and associated 'dispatcher' punchline. This is especially the case for the critical 'dispatcher' punchline itself ! (so that we can better troubleshoot this punchline in case of incident on the monitoring events dispatching chain).

It also allows to reduce the load on the central monitoring events dispatching punchline (e.g. having local metribeat daemons reporting directly to the Elasticsearch).

Platform Events Housekeeping

A housekeeping channel must be configured for platform tenant, in order to clean old metrics and events indices.

Here is an example of housekeeping service configuration :

tenants/platform/channels/housekeeping/elasticsearch-housekeeping.json

Platform Monitoring Channels

To compute a synthetic health report for platform monitoring channels, a channels monitoring service instance must be configured also for platform tenant. Please refer to Channels Monitoring documentation for monitoring rules and configuration. Here is a reference configuration example for the platform channels monitoring service :

tenants/platform/channels/monitoring/channels_monitoring.yaml

Platform Monitoring Service

To compute a synthetic health report for platform health, a platform monitoring service instance must be configured also for platform tenant on EACH site. Please refer to Platform Monitoring documentation for configuration and API access. Here is a reference configuration example for the platform monitoring service :

tenants/platform/channels/monitoring/platform_monitoring.yaml