Skip to content

Log Processor Punchline

Refer to the central log management overview

Reference central site Pipelines (image)

Key Design highlights

Scalability

Relies on multiple executors of the punch node inside the punchline ("executors" setting).

If it is needed to scale on more than 1 JVM (more than 1 server), it is possible to add other instances of the same application, as long as the input kafka topic has enough 'partitions' so that each jvm can consume at least 1 partition in the input topic.

Modularity/layers through multiple processing punchlets

For reuse of punchlets, and to cope with the cases where a same data type is wrapped in different protocol envelopes (e.g. syslog-wrapped vs raw event), it is a good habbit to have successive punchlets, each dealing with one protocol layer, or increasing detail of processing.

E.g. processing an "apache daemon event" that reports inside a local log file, that is collected and forwarded by a rsyslog daemon, then sent to the punch 'syslog input node', then we would have need for the following succesive punchlets :

  • input.punch => responsible for retrieving metadata from the events records (i.e. information captured/generated by the punch input node, such as the remote IP/HOST, the reception timestamp, the local port, a unique log id generated at input time...). The data remaining to be parsed is the syslog frame.

  • parsing_syslog_header.punch => responsible for parsing just the syslog frame (event source device hostname, associated timestamp...). The data remaining to be parsed is the raw log line from the original log file.

  • apache_httpd/parsing.punch => responsible for parsing the apache raw event, retrieving individual fields from the log line.

  • ...other punchlets for additional enrichment / normalization depending on the target data model (ECS, XDAS...)

Errors management

Because there is a "punch node" in the punchline (for the parsing/normalization/enrichment), there is a risk of exception/error in the punchlet code.

This means that some specific processing output queue (in our example, the reftenant-errors topic) has to be identified to retrieve these unexpected documents, so as to not lose them (so that someone can later identify the problem, and requires changes in source device configuration, or device type discovery mechanisms).

This is very important, as even some unexpected slight change in input log format may cause a malfunction of a parsing expression developed on the standard format hypothesis.

Performance / CPU consumption / Scalability

the performance of the processing is related to the trhoughput of the least efficient of the processing punchlets. Si if too many processor/threads (executors) have to be allocated to the punch node, this may imply that one of the punchlet is not optimized and has to be refactored and unit-tested for CPU/processing-time efficiency.

Scalability can be achived at small scale (up to several thousands EPS) by increasing the 'executors' count of the punchlet node. The punchline internal mechanisms will load-balance the processing of tuples between these multiple instances, up to the maximum number of threads allowed to the jvm and up to the maximum number of threads equal to the executors count for the node.

Of course if more threads/CPUs are required to reach overall needed throughput than a single JVM can have (because of VMs/servers sizing), then the scalability must go one step further : having multiple instances of the same punchline.

To have more than 1 instance of a processing punchline means that the multiple instances will consume the same input kafka queue. The mean for these instances to work concurrently is to 'share' the input flows by assigning part of the input topic(kafka queue) to each one. This is supported by the 'partitions' concept of kafka. At least as many partitions must exist in the source kafka topic as you will want instances of kafka consumer (input nodes) in the processing instances. If you want to scale up the power of the processing stage, you can therefore increase the number of partitions in the input topic, and increase the number of processing punchlines up to this partitions number.

!!! Note Increasing the partitions number can be done without data loss in the kafka topic, but may require a restart of the PRODUCER punchline that writes data in the kafka topic, so that the written data is balanced including the additional partitions. Please refer to HOWTO alter existing kafka topics.

Apache processor Punchline example

tenants/reftenant/channels/apache_httpd/processor.hjson