Skip to content

Log Indexer Punchline

Refer to the central log management overview

Reference central site Pipelines (image)

Key Design highlights

Additional fields

Into Elasticsearch idnexing API, the Elastic output must write json documents. Here, the Elastic output node is configured to use a json document prepared by the processing stage (see processors punchlines). Here, this json document comes as a json string in the 'log' field of the input kafka topic.

Elasticsearch bolt is able to enrich on-the-fly the json document with root-level fields if needed (additional_document_value_fields setting). Here, we use this capability to

  • add the raw log in a 'message' field of the json document
  • add an indexation timestamp in '@timestamp' field. Note that this is not truly the 'event time', which is stored in normalized field 'obs.ts'. So when viewing logs in the kibana, the user has to be aware of the source of each timestamp field, and use the appropriate one for his/her tasks (this is mainly done when designing 'indices patterns' on which data the user dashboards are constructed.)

    Storing the indexation time is useful to have a clear traceability of the difference between event time and indexing time, for troubleshooting the overall latency of the ccollection chain, or when troubleshooting problems in the correlation rules.

Multiple streams and _ppf_metrics special stream

As some other Punch nodes, the Elasticsearch Output is able to process multiple input streams without the need to have separate Output nodes instances. This is useful to reduce memory consumption. This requires that we rovide a separate configuration for each input stream through 'per_stream_settings' section, for each stream to which the Elastic Output node has subscribed. Note that ppf_metrics is a special internal stream, used to circulate latency measuring test frames. These frames are transformed into metrics automatically, that will be reported through the metrics reporters mechanism. So no need to provide an output configuration for this specific stream in the 'per_stream_settings'.

Common or splitted output channels

When designing your production channels, you will have to decide between dedicating separate channels for outputing different nature of logs (apache, windows..) to Elasticsearch, or having a common one (possibly with a single 'output' kafka topic).

This is a tradeof case between : - risk that a 'flood' in one of the log sources type induces heavy backlog on ALL log sources types (shared channel design) - risk of too many memory consumption and channels management (dedicated channels design)

In particular, the 'output' kafka topic(s) are shared between all outputs (archiving, forwarding to dual site, indexing), so multiplying the kafka topics imply increasing the amount of channels components needed to read them all.

This design and the tradeoffs have to be forethought globally, for all output needs at the same time.

It is sometimes a good choice to have dedicated output indexer channels for 'big/verbose' log kinds/technologies and 'common output indexer channels' for multiple small ones.

Having multiple, dedicated output indexer channels also allow to optimize the throughput of each log types with distinct settings, especially when the targetted elasticsearch indices are separate for better storage efficiency (different columns for different log types).

When the indexer is mutualized, but we want to keep the different log types in separate Elasticsearch indices (for indexing/querying efficiency), the index name has to be computed beforehand in the processor stage, and conveyed to the indexing punchline through a separate field (in our reference configuration, this field is named 'es_index')

Scalability / Performance

For processor punchlines, the key for performance is running multiple nodes instances, to parallelize the computing power of multiple cpu threads. BUT here we are not computing anything. The performance issue is leveraging the Elasticsearch parallel processing capacity, by issuing multiple asynchronous concurrent REST indexing requests without waiting for the preceding one to be finished (otherwise, the end-to-end latency would be reducing the throughput).

The Elasticsearch output node is by design doing this parallel requests seinding, so it is not needed to have multiple instances/executors of this Output node.

The maximum number of concurrent requests is given by the 'batch_size' (max number of logs per request) dividing the topology.max.spout.pending setting in the global settings section of the punchline. Optimum setting is usually several thousands logs per batch (say 3000 or 5000), but influenced by log size.

Warning

Be careful when increasing the max.spout.pending as sending too many batches at the same time will stop being efficient.

On the contrary, this will lengthen the time a log is waiting in the Elasticsearch input queues, up to the threshold where the log will be considered 'in timeout' by the punchline spouts (usually, 1 minute time), leading to replays of the same log (regardless of actual indexing the first time).

Experimenting is the best way to find the sweet spot where indexing is fast, and the overall processing time of the tuple (see Elastic Output node metrics) is less than 1 or 2 seconds.

Errors management

In production, we want to avoid log loss. Errors when trying to index logs into Elasticsearch can occur (either because of log anomalies, or because of Elasticsearch temporary unavailability)

Indexing rejection by Elasticsearch

The reindex_failed_documents: true setting is preventing log loss. If an indexing error occurs, with these Elasticsearch Output settings:

  • Field types Mapping Exception (that will never allow indexing the log, event after retries, because the faulty document will be the same) ==> This will be "wrapped" as an error document (with string containing the whole document), so as to inject it in elasticsearch (of course, not indexed as nominal documents would).
  • Other Elastic errors => The punchline will be 'stuck' , retrying the indexing until the error condition is solved (e.g. the Elasticsearch is available again)

Processing errors indexing

Of course, when a processing error has occured in the processing stage, some 'error' document will be produced there instead of the normal parsed json document. So an elasticsearch indexer punchline has to be configured to read these errors from the associated kafka output topic (cf processor punchlines).

Because error documents and nominal parsed logs do not have the same associated metadata, the fields configuration in the indexer punchline is slightly difference (see 'published' fields of the kafka input, that match the standard inbuilt metadata fields of parsing error documents, as described by Punch processing node reference documentation

Parsed logs indexer Punchline example

Here is an example of a 'mutualized multi-log-techonologies' indexing punchline, where the index name has been pre-computed by the processor punchline.

tenants/reftenant/channels/indexer/parsed_logs_indexer.hjson

Processing errors indexer Punchline example

Here is an example of the special case 'processing errors' indexing punchline.

Please refer to Processing errors indexing or inlined comments for some explanations.

tenants/reftenant/channels/indexer/errors_indexer.hjson