Click on this link to go to Central Log Management reference architecture overview
Key design highlights¶
Scalability relies on multiple executors of the Punchlet node inside the punchline (
It is possible to scale on more than 1 JVM or more than 1 server.
This is made possible by spawning other instances of the same application.
The Kafka input topic must have enough
partitions so that each JVM consumes at least 1 partition.
Modularity through multiple processing punchlets¶
For reuse of punchlets, and to cope with the cases where a same data type is wrapped in different protocol envelopes (e.g. syslog-wrapped vs raw event), it is a good habbit to have successive punchlets, each dealing with one protocol layer, or increasing detail of processing.
E.g. processing an "apache daemon event" that reports inside a local log file, that is collected and forwarded by a rsyslog daemon, then sent to the punch 'syslog input node', then we would have need for the following succesive punchlets :
input.punch: responsible for retrieving metadata from the events records (i.e. information captured by the Punch input node, such as the
_ppf_id... See reserved fields. The data remaining to be parsed is the syslog frame.
parsing_syslog_header.punch: responsible for parsing just the syslog frame (event source device hostname, associated timestamp...). The data remaining to be parsed is the raw log line from the original log file.
apache_httpd/parsing.punch: responsible for parsing the apache raw event, retrieving individual fields from the log line.
...other punchlets for additional enrichment / normalization depending on the target data model (ECS, XDAS...)
Because there is a
punchlet_node in the punchline (for the parsing/normalization/enrichment), there is a risk of exception/error in the punchlet code.
This means that some specific processing output queue (in our example, the
reftenant-errors topic) has to be identified to retrieve these unexpected documents, so as to not lose them (so that someone can later identify the problem, and requires changes in source device configuration, or device type discovery mechanisms).
This is very important, as even some unexpected slight change in input log format may cause a malfunction of a parsing expression developed on the standard format hypothesis.
Performance / CPU consumption / Scalability¶
Processing performance is limited by the throughput of the least efficient processing punchlets.
If too many processor/threads (executors) have to be allocated on a
this may imply that one of the punchlet is not optimized and has to be refactored and unit-tested for CPU/processing-time efficiency.
Scalability can be achieved at small scale (up to several thousands EPS) by increasing the
executors count of the
The punchline internal mechanisms will load-balance the processing of tuples between these multiple instances,
up to the maximum number of threads allowed by the JVM.
If more threads/CPUs than a single JVM can have (because of VMs/servers sizing) are required, then the scalability must go one step further : having multiple instances of the same punchline.
To have more than 1 instance of a processing punchline means that the multiple instances will consume the same Kafka input queue. These instances must work concurrently and share the input flows. This is done by assigning a partition of the kafka input queue to each one. At least as many partitions as processing instances must exist in the source kafka topic.
Increasing the partition number can be done without data loss in the kafka topic. However, it may require a restart of the producer punchline that writes data in the kafka topic. Kafka will rebalance produced data on the additional partitions. Please refer to HOWTO alter existing kafka topics.
Apache processor Punchline example¶
version: "6.0" tenant: reftenant channel: processor runtime: shiva type: punchline name: apache dag: # Kafka input - type: kafka_input component: kafka_front_input settings: topic: reftenant-receiver-apache start_offset_strategy: last_committed encoding: lumberjack publish: - stream: logs fields: - log - raw_log - device_type - subsystem - es_index - _ppf_timestamp - _ppf_id - stream: _ppf_metrics fields: - _ppf_latency # Punchlet node - type: punchlet_node settings: punchlet_json_resources: - punch-webserver-parsers-1.0.0/com/thalesgroup/punchplatform/webserver/apache_httpd/resources/http_codes.json - punch-webserver-parsers-1.0.0/com/thalesgroup/punchplatform/webserver/apache_httpd/resources/taxonomy.json punchlet: - punch-common-punchlets-1.0.0/com/thalesgroup/punchplatform/common/input.punch - punch-common-punchlets-1.0.0/com/thalesgroup/punchplatform/common/parsing_syslog_header.punch - punch-webserver-parsers-1.0.0/com/thalesgroup/punchplatform/webserver/apache_httpd/parser_apache_httpd.punch - punch-webserver-parsers-1.0.0/com/thalesgroup/punchplatform/webserver/apache_httpd/enrichment.punch - punch-webserver-parsers-1.0.0/com/thalesgroup/punchplatform/webserver/apache_httpd/normalization.punch subscribe: - stream: logs component: kafka_front_input - stream: _ppf_metrics component: kafka_front_input publish: - stream: logs fields: - log - raw_log - device_type - subsystem - es_index - _ppf_timestamp - _ppf_id - stream: _ppf_metrics fields: - _ppf_latency - stream: _ppf_errors fields: - _ppf_timestamp - _ppf_id - _ppf_platform - _ppf_tenant - _ppf_channel - _ppf_topology - _ppf_component - _ppf_error_message - _ppf_error_document - subsystem - device_type - raw_log # Kafka output - type: kafka_output component: kafka_back_output_logs settings: topic: processor-logs-output encoding: lumberjack producer.acks: "1" subscribe: - stream: logs component: punchlet_node - stream: _ppf_metrics component: punchlet_node - type: kafka_output component: kafka_back_output_errors settings: topic: processor-errors-output encoding: lumberjack producer.acks: "1" subscribe: - stream: _ppf_errors component: punchlet_node metrics: reporters: - type: kafka reporting_interval: 60 settings: topology.max.spout.pending: 2000 topology.component.resources.onheap.memory.mb: 120 topology.enable.message.timeouts: true topology.message.timeout.secs: 30