Skip to content

PunchPlatform Supervision (by Nagios)

This chapter explains :

  • the available metrics to monitor the system performance and capacity
  • how to monitor a platform from an external supervision system, such as Nagios

Supervision

This part defines the resources to monitor in Supervision system.

Overview of resources

To ensure that PunchPlatform system is working, supervision must at least target :

  • The running status of a number of key Services (from a linux Point of view)
  • The system-level resources consumption of all PunchPlatform servers (Disk space, CPU/RAM usage)
  • The Health status indicators published by PunchPlatform admin service space,

Optionally, supervision should also target :

  • Backlog levels from Elasticsearch (Admin)
  • Supervisord error status
  • Pacemaker error status
  • Elasticsearch nodes count through REST API

Key Services to watch

On the following PunchPlatform servers, supervision must ensure service supervisor is running :

  • Elasticsearch servers

    bash $ supervisorctl status elasticsearch | grep RUNNING

  • Storm slaves of all clusters (LTR, LMR, STO)

    bash $ supervisorctl status storm-supervisor | grep RUNNING

  • Storm masters of all clusters (usually on LMC and LTRs)

    bash $ supervisorctl status storm-nimbus | grep RUNNING $ supervisorctl status storm-ui | grep RUNNING

  • Zookeeper servers (usually on KAF and LTRs)

    bash $ supervisorctl status zookeeper | grep RUNNING

  • Kafka servers

    bash $ supervisorctl status kafka-<cluster_name> | grep RUNNING

    Note

    Take a look at punchplatform.properties or architecture documents. Usual clusters name are "front" and "back".

  • Kibana servers

    sh $ supervisorctl status | grep kibana | ! grep -v RUNNING

  • Ceph servers

    sh $ sudo systemctl | grep 'ceph-osd-main@' | ! grep -v RUNNING $ sudo systemctl | grep 'ceph-osd-mon@' | ! grep -v RUNNING

On all clusters servers that use Virtual IP (LTR nodes, KIB nodes, LMC admin nodes, GRAfana nodes), supervision must ensure services pacemaker and corosync are active

1
2
$ sudo service corosync status
$ sudo service pacemaker status

PunchPlatform Health indicator API

To monitor the platform health using a dedicated tool (i.e. Nagios, Centreon, Zabbix, ...), the PunchPlatform exposes a JSON API. We keep an Elasticsearch resource updated with the latest platform health state.

This resources is located at <es_url>/punchplatform/api/v2. For example, using curl, you can fetch it with:

1
$ curl -s localhost:9200/punchplatform/api/v2

The returned document will look like this one:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
{
    "_index": "punchplatform",
    "_type": "api",
    "_id": "v2",
    "_version": 53,
    "found": true,
    "_source": {
        "@timestamp": "2018-05-23T14:04:02.584Z",
        "platform": {
            "health": "yellow"
        },
        "storm": {
            "health": "green",
            "clusters": {
                "main": {
                    "health": "green",
                    "cluster": {
                        "health": "green",
                        "details": {
                            "slotsFree": 30,
                            "memAssignedPercentUtil": "0.0",
                            [...]
                        }
                    },
                    "nimbus": {
                        "health": "green",
                        "details": {
                            "status": "Leader",
                            "port": 6627,
                            [...]
                        }
                    },
                    [...]
                }
            }
        },
        "elasticsearch": {
            "health": "yellow",
            "clusters": {
                "es_search": {
                    "health": "yellow",
                    "details": {
                        "number_of_pending_tasks": 0,
                        "cluster_name": "es_search",
                        [...]
                    }
                }
            }
        },
        "zookeeper": {
            "health": "green",
            "clusters": {
                "common": {
                    "health": "green",
                    "details": {
                        "zk_packets_sent": "164687",
                        "zk_max_latency": "92",
                        [...]
                    }
                }
            }
        },
        [...]
    }
}

At the top level, @timestamp is the last update time in ISO format. Other keys describe a PunchPlatform component present in the punchplatform.properties <punchplatform_properties> (Kafka, Storm, Elasticsearch, ...).

The "health" keys can take 3 different values:

  • green - everything is OK
  • yellow - non nominal mode, a configuration problem is detected or some nodes are down but the service is still available
  • red - critical failure, the service is down

This document represents the complete platform health. If you only need a subsection of it (let say to only monitor Elasticsearch), feel free to parse it. For example, curl works pretty well with Jq:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
$ curl -s localhost:9200/punchplatform/api/v2 | jq -rc '._source.elasticsearch.health'
yellow

$ curl -s localhost:9200/punchplatform/api/v2 | jq -rc '._source.elasticsearch'
{
  "health": "yellow",
  "clusters": {
    "es_search": {
      "health": "yellow",
      "details": {
        "number_of_pending_tasks": 0,
        "cluster_name": "es_search",
        "active_shards": 5,
        "active_primary_shards": 5,
        "unassigned_shards": 1,
        "delayed_unassigned_shards": 0,
        "timed_out": false,
        "relocating_shards": 0,
        "initializing_shards": 0,
        "task_max_waiting_in_queue_millis": 0,
        "number_of_data_nodes": 1,
        "number_of_in_flight_fetch": 0,
        "active_shards_percent_as_number": 83.33333333333334,
        "status": "yellow",
        "number_of_nodes": 1
      }
    }
  }
}

Note

To learn more about Jq, a lightweight and flexible command-line JSON processor, refer to the official documentation.

(Optional) Backlog Metrics Supervision

The "backlog" is the amount of messages that are stored in a Kafka topic, but have not yet be processed by the consuming layer(s) of PunchPlatform channels.

Because the PunchPlatform inbuilt log channels health monitoring system is based on monitoring latencies inside the log channels, the backlog raising is automatically monitored for channels that have autotest latency control paths configured to encompass the kafka buffering (i.e with at least one latency control path with a start point at a spout somewhere ABOVE the kafka layer, and an end point somewhere AFTER the kafka layer). This configuration SHOULD be done for each channel, in order to have automatic monitoring of unusual backlog (which might mean unsufficient configured processing capability for this channel, or incoming messages flood on this specific channel).

Nevertheless, it is additionnaly possible to externally configure a separate (not punchplatform-provided) alerting tool to raise alerts in case of high messages backlog : by using REST request to the elasticsearch REST API (using metrics backend Virtual IP), Supervision can retrieve mean values of the backlogs of any channels that includes a Kafka spout.

PunchPlatform Metrics

Metrics Introduction

Metrics categories

The PunchPlatform metrics system collects three families of metrics :

  • The channel-level applicative metrics : computed by Storm components inside the topologies
  • Java Virtual Machine metrics : computed by a generic plugin. The same metrics will be reported by each worker and provide information about garbage collection, cpu usage etc..
  • Operating System-level metrics : collected by agents installed on PunchPlatform application servers : kafka, storm, elasticsearch, metrics backend, PunchPlatform admin servers.

Only the channel-level metrics can be activated/configured by user configuration in the channel configuration files.

Channel-level applicative metrics

All these metrics are storm-component-related.

Please refer to the spouts and bolts documentation for a detailed list of published metrics and associated names/context tags.

PunchPlatform Storm Java VM metrics

All specific PunchPlatform topologies components run inside storm 'workers' Java virtual machines or in "local mode" JVM on dedicated servers.

The following metrics groups are published for each of these JVM :

  • memory (Specially interesting is the ratio memory.total.used/memory.total.max)
  • pools
  • threadstates
  • gc

These groups are published under the following metrics path :

punchplatform.jvm.storm..@

Where :

  • topology_id is the STORM topology id for jvms running tasks belonging to a storm topology. The topology id begins with _
  • jvm_process_id is the linux process id of the JVM
  • jvm_host is the linux hostname on which the JVM runs

Overview

The PunchPlatform comes with in-built application and system level monitoring. The collected metrics are stored in an Elasticsearch backend, on top of which you can design Grafana and Kibana dashboards.

With Elasticsearch, it is not the most efficient to search/retrieve metrics based on name-pattern. Instead each metric has a (short) name, a value, and a number of tags : tenant, cluster, topology id... etc. Metrics can be requested using filters or exact values of some of these tags, which is both efficient and flexible.

With that design the name of a metric identifies a category, grouping all metrics sharing the same nature. Tags provide additionnal contextual information, to characterize typically the point of measure.

The elasticsearch query is of the form "name:uptime AND tags.storm.component_id:punch_bolt AND tags.storm.topology_name:parsing AND tags.pp_platform_id:punchplatform AND tags.pp.channel:mychannel".

Elasticsearch back-end has been selected as the PunchPlatform metrics backend because :

  • It is resilient and scalable
  • It allows evolutive additive tagging of metrics value, that can ease dashboarding without negative impact on existing dashboard
  • It reduces the number of COTS needed for a PunchPlatform deployment (and especially, it is already included in the standalone deployment of punchplatform)

Naming convention for metrics context tags

All PunchPlatform metrics context tags are stored as subfields of the "tags" field of metrics documents. When providing the tag subfield names in the PunchPlatform documentation, the . character identifies nested subfields.

PunchPlatform Metrics

Note

the elasticsearch template for punchplaform metrics is "metrics-*" and use the timefield name "ts".

Storm Metrics

NAME DESCRIPTION [TYPE] UNIT TAG CONTEXT
storm.tuple.fail failed tuples rate [Meter] Tuples/second context_storm, context_channel
storm.tuple.ack acked Tuples rate [Meter] Tuples/second context_storm, context_channel
storm.tuple.rtt Tuple traversal time [Histogram] milliseconds context_storm, context_channel
storm.tuple.pending pending Tuples [Counter] Tuples context_storm, context_channel
storm.tuple.length average Tuple length [Histogram] bytes context_storm, context_channel
storm.uptime indicate the process is still alive [Counter] second context_storm, context_channel
storm.tuple.eps 'max', eps max on 500ms on eps calculated during 30s (default values) [TimeValue] Tuples/second context_storm, context_channel

Netty Metrics

NAME DESCRIPTION [TYPE] UNIT TAG CONTEXT
 netty.app.recv decoded bytes received [Counter] bytes context_netty
 netty.raw.recv raw bytes received [Counter] bytes context_netty
 netty.app.sent decoded bytes sent [Counter] bytes context_netty
 netty.raw.sent raw bytes sent [Counter] bytes context_netty
 netty.compression.recv.ratio compression ratio of received data [Gauge] ratio context_netty
 netty.compression.sent.ratio compression ratio of sent data [Gauge] ratio context_netty

Kafka Spout Metrics

The Kafka Spout publishes several metrics, some of them related to the various backlogs. Here is a quick view of the three published backlogs.

image

The so called fetch backlog ([backlog.fetch]) is the one that tells you if your consumer (aka spout) is lagging behind your producer(s). The replayable backlog tells you how many message you can potentially replay, it also has an important operational meaning.

The commit backlog is more iformational, it gives you an idea of how many message you will replay should you restart a topology.

Info

the metrics published by the spout stop being published (of course) whenever you stop your topology. However the punchplatform also publishes the [lag]{.title-ref} of all known topic/partitions for all defined consumer groups from an external monitoring service, so that you never loose visibility on your backlogs.

Here is the detailed list of metrics published by the KafkaSpout.

  • backlog.commit: [Gauge] long
    • the commit backlog expresses the number of message that would be re-read in case of restart. This measure the gap between the latest saved committed offset and the latest offset.
    • This metric is meaningfull only with the "last_committed" strategy.
  • backlog.fetch: [Gauge] long
    • the backlog expressed in number of message that be re-read in case of restart. This measure the gap between the latest saved committed offset and the latest offset.
  • backlog.replayable: [Gauge] long
    • the backlog expressed in greatest number of message that can be possibly reread from this partition. This is an indication of the message you can possibly replay from Kafka before they are definitively discarded.
  • commit.latency: [Timer] ms
    • the time it takes to perform an offset commit for this partition. This gives an idea of the Kafka broker speed of handling commits.
  • msg.rate: [Meter] per partition read/sec
    • this rate measure the number of effective read from a partition
  • msg.size: [Histogram] size
    • the average message size
  • offset.ack.rate: [Meter] acks/sec
    • the rate of offset acknowledgement
  • offset.fail.rate: [Meter] acks/sec
    • the rate of offset failure
  • offset.earliest: [Gauge] long
    • the earliest offset for this partition
  • offset.latest: [Gauge] long
    • the latest offset for this partition
  • offset.committed: [Gauge] long
    • the committed offset for this partition
  • time.current: [Gauge] long
    • the time associated to the currently read message. That value is a milliseconds epoch unix timestamp.
  • time.delay: [Gauge] long
    • the time difference in milliseconds between now and time.current. Visualising this gauge gives you an easy view of how late in time your consumer is.

Note

all these metrics are per topic, per partition and per consumer group.

Syslog Spout Metrics {#metrics_syslog_spout}

The SyslogSpout publishes the following metrics :

NAME DESCRIPTION [TYPE] UNIT TAG CONTEXT
 syslog.server.blocked_by_queue_full_ns time elapsed in reception thread while waiting due to input queue full (may cause message loss if UDP) [Meter] nanoseconds  context_storm, context_channel

File Spout Metrics

The FileSpout publishes the following metrics :

Elasticsearch Spout Metrics

The ElasticsearchSpout publishes the following metrics :

NAME DESCRIPTION [TYPE] UNIT TAG CONTEXT
elasticsearch.spout.ack.uptotimestamp all documents/logs have [Meter] context_job{.interpreted-text
been extracted, up to milliseconds role="ref"},
this instant since context_storm{.interpreted-text
01/01/1970 role="ref"},
00:00(UTC) context_job{.interpreted-text
role="ref"},
context_storm{.interpreted-text
role="ref"},
context_elasticsearch{.interpreted-text
role="ref"}

elasticsearch.spout.fullyacked.count number of logs/documents [Gauge] context_job{.interpreted-text which have been number of role="ref"}, successfully extracted documents context_storm{.interpreted-text and role="ref"}, processed/acknowledged context_job{.interpreted-text by the topology since role="ref"}, the beginning of the context_storm{.interpreted-text extract job, and will role="ref"}, not be replayed in case context_elasticsearch{.interpreted-text of topology failure. role="ref"} Processed logs/documents
are counted only once,
even if there has been
earlier failure/retries
during the job lifetime.

elasticsearch.spout.fetch.rate number of documents [Meter] context_job{.interpreted-text extracted from count and rate role="ref"}, elasticsearch context_storm{.interpreted-text role="ref"}, context_job{.interpreted-text role="ref"}, context_storm{.interpreted-text role="ref"}, context_elasticsearch{.interpreted-text role="ref"}

elasticsearch.spout.fetch.timeslice.startms beginning of time slice [Meter] context_job{.interpreted-text for which we are milliseconds role="ref"}, currently extracting since context_storm{.interpreted-text documents from 01/01/1970 role="ref"}, Elasticsearch 00:00(UTC) context_job{.interpreted-text role="ref"}, context_storm{.interpreted-text role="ref"}, context_elasticsearch{.interpreted-text role="ref"}


Lumberjack Spout Metrics {#metrics_lumberjack_spout}

::: {.index} pair: LumberjackSpout; Metrics :::

The LumberjackSpout publishes the following metrics :

metrics_storm

metrics_netty

NAME DESCRIPTION [TYPE] UNIT TAG CONTEXT

netty.lumberjack.compressed [Counter] context_storm{.interpreted-text bytes role="ref"}, context_channel{.interpreted-text role="ref"}, context_netty{.interpreted-text role="ref"}

netty.lumberjack.uncompressed [Counter] context_storm{.interpreted-text bytes role="ref"}, context_channel{.interpreted-text role="ref"}, context_netty{.interpreted-text role="ref"}

netty.lumberjack.decoded [Counter] context_storm{.interpreted-text bytes role="ref"}, context_channel{.interpreted-text role="ref"}, context_netty{.interpreted-text role="ref"}


Http Spout Metrics {#metrics_http_spout}

The HttpSpout publishes the following metrics:

metrics_storm

Admin Metrics {#metrics_kafka_admin}

The Admin service publishes the following metrics :

NAME DESCRIPTION [TYPE] UNIT TAG CONTEXT
autotest_latency Latency of channel [Gauge] context_channel_autotestlatency{.interpreted-text
components long role="ref"}

kafka.earliest-available-offset Earliest offset of kafka. [Gauge] context_platform, long context_kafka_partition{.interpreted-text role="ref"}

kafka.latest-available-offset Latest offset of kafka. [Gauge] context_platform, long context_kafka_partition{.interpreted-text role="ref"}

kafka.earliest-available-timestamp Kafka timestamp of the [Gauge] context_platform, earliest message. long context_kafka_partition{.interpreted-text role="ref"}

kafka.latest-available-timestamp kafka timestamp of latest [Gauge] context_platform, message. long context_kafka_partition{.interpreted-text role="ref"}

kafka.consumer.lag.message Difference between latest [Gauge] context_platform, offset and the current long context_kafka_partition_consumer{.interpreted-text consumer offset. role="ref"}

kafka.consumer.lag.time Difference between the [Gauge] context_platform, timestamp of latest long context_kafka_partition_consumer{.interpreted-text message and the timestamp role="ref"} of the current massage at
consumer offset.


Kafka Bolt Metrics {#metrics_kafka_bolt}

The KafkaBolt publishes the following metrics :

metrics_storm

NAME DESCRIPTION [TYPE] UNIT TAG CONTEXT

kafka.bolt.messages.bytes average message [Counter] context_storm{.interpreted-text size bytes role="ref"}, context_channel{.interpreted-text role="ref"}, context_kafka{.interpreted-text role="ref"}

kafka.bolt.messages.batched average Tuple [Histogram] context_storm{.interpreted-text length messages role="ref"}, context_channel{.interpreted-text role="ref"}, context_kafka{.interpreted-text role="ref"}

kafka.bolt.messages.rate decoded bytes [Meter] context_storm{.interpreted-text received message/second role="ref"}, context_channel{.interpreted-text role="ref"}, context_kafka{.interpreted-text role="ref"}


Syslog Bolt Metrics {#metrics_syslog_bolt}

The SyslogBolt publishes the following metrics :

metrics_storm

metrics_netty

Lumberjack Bolt Metrics {#metrics_lumberjack_bolt}

The LumberjackBolt publishes the following metrics, prefixed with

metrics_storm

metrics_netty

NAME DESCRIPTION [TYPE] UNIT TAG CONTEXT

netty.lumberjack.compressed compressed bytes [Counter] context_storm{.interpreted-text count bytes role="ref"}, context_channel{.interpreted-text role="ref"}, context_netty{.interpreted-text role="ref"}

netty.lumberjack.decoded application bytes [Counter] context_storm{.interpreted-text count bytes role="ref"}, context_channel{.interpreted-text role="ref"}, context_netty{.interpreted-text role="ref"}

netty.lumberjack.uncompressed uncompressed [Counter] context_storm{.interpreted-text bytes bytes role="ref"}, context_channel{.interpreted-text role="ref"}, context_netty{.interpreted-text role="ref"}


Archive Processor Bolt Metrics

When "write_to_objects_storage" publication is activated, the Archive processor Bolt publishes the following metrics :

metrics_storm


NAME DESCRIPTION [TYPE] UNIT TAG CONTEXT


ceph.cluster.kbytes.used storage space [Gauge] instant context_storm{.interpreted-text used by the kilobytes count role="ref"}, cluster context_channel{.interpreted-text (including role="ref"}, management data) context_ceph

ceph.cluster.kbytes.free unused storage [Gauge] instant context_storm{.interpreted-text space available kilobytes count role="ref"}, for the cluster context_channel{.interpreted-text role="ref"}, context_ceph

ceph.cluster.objects.stored number of objects [Gauge] instant context_storm{.interpreted-text currently stored count role="ref"}, in the cluster context_channel{.interpreted-text role="ref"}, context_ceph

ceph.pool.kbytes.used storage space [Gauge] instant context_storm{.interpreted-text used specifically kiloBytes count role="ref"}, by this object context_channel{.interpreted-text pool in the role="ref"}, cluster context_ceph

ceph.pool.objects.stored number of objects [Gauge] instant context_storm{.interpreted-text currently stored count role="ref"}, in the object context_channel{.interpreted-text pool in the role="ref"}, cluster context_ceph

ceph.pool.objects.degraded number of objects [Gauge] instant context_storm{.interpreted-text with missing count role="ref"}, replica context_channel{.interpreted-text role="ref"}, context_ceph

ceph.pool.objects.unfound number of objects [Gauge] instant context_storm{.interpreted-text with unknown count role="ref"}, placement context_channel{.interpreted-text role="ref"}, context_ceph

ceph.pool.objects.missingonprimary number of objects [Gauge] instant context_storm{.interpreted-text missing in count role="ref"}, primary context_channel{.interpreted-text role="ref"}, context_ceph

ceph.partition.objects.stored number of objects [Gauge] instant context_storm{.interpreted-text currently stored count role="ref"}, in the partition context_channel{.interpreted-text of the topic role="ref"}, context_ceph_partition{.interpreted-text role="ref"}

ceph.partition.tuples.stored number of tuples [Gauge] instant context_storm{.interpreted-text currently stored count role="ref"}, in the partition context_channel{.interpreted-text of the topic role="ref"}, context_ceph_partition{.interpreted-text role="ref"}

ceph.partition.bytes.stored number of bytes [Gauge] instant context_storm{.interpreted-text currently stored bytes count role="ref"}, in the partition context_channel{.interpreted-text of the topic role="ref"}, context_ceph_partition{.interpreted-text role="ref"}

ceph.partition.uncompressed.bytes.stored number of bytes [Gauge] instant context_storm{.interpreted-text stored in the bytes count role="ref"}, partition of the context_channel{.interpreted-text topic (before role="ref"}, compression) context_ceph_partition{.interpreted-text role="ref"}

ceph.partition.objects.written number and rate [Meter] number context_storm{.interpreted-text of objects of objects role="ref"}, written in the context_channel{.interpreted-text topic role="ref"}, context_ceph_partition{.interpreted-text role="ref"}

ceph.partition.tuples.written number and rate [Meter] number context_storm{.interpreted-text of tuples written of role="ref"}, in the topic tuples(documents context_channel{.interpreted-text or logs) role="ref"}, context_ceph_partition{.interpreted-text role="ref"}

ceph.partition.bytes.written number of bytes [Meter] number context_storm{.interpreted-text written in the of bytes role="ref"}, partition of the context_channel{.interpreted-text topic (and rate) role="ref"}, context_ceph_partition{.interpreted-text role="ref"}

ceph.partition.uncompressed.bytes.written number of bytes [Meter] number context_storm{.interpreted-text written in the of bytes role="ref"}, partition of the context_channel{.interpreted-text topic and rate role="ref"}, (before context_ceph_partition{.interpreted-text compression) role="ref"}


FileReader Bolt Metrics

The FilesReaderBolt publishes the following metrics :

metrics_storm

NAME DESCRIPTION [TYPE] UNIT TAG CONTEXT

reader.files.read files successfully [Meter] context_storm{.interpreted-text extracted integer role="ref"}, context_channel{.interpreted-text role="ref"}

reader.files.failure files that were [Meter] context_storm{.interpreted-text not (or not fully) integer role="ref"}, extracted context_channel{.interpreted-text role="ref"}

reader.lines.read lines successfully [Meter] context_storm{.interpreted-text extracted integer role="ref"}, context_channel{.interpreted-text role="ref"}


Elasticsearch Bolt Metrics

The ElasticsearchBolt publishes the following metrics :

metrics_storm

NAME DESCRIPTION [TYPE] UNIT TAG CONTEXT

storm.documents.indexation.rate number of document [Meter] context_storm{.interpreted-text cumulate in bulk integer role="ref"}, request context_channel{.interpreted-text role="ref"}

storm.errors.indexation.rate number of error [Meter] context_storm{.interpreted-text cumulate in bulk integer role="ref"}, request context_channel{.interpreted-text role="ref"}


Filter Bolt Metrics

The FilterBolt publishes the following metrics :

metrics_storm

NAME DESCRIPTION [TYPE] UNIT TAG CONTEXT

drop.rate drop rate of filtered logs [Meter] context_storm{.interpreted-text integer role="ref"}, context_channel{.interpreted-text role="ref"}

storm.tuple.emit emitted tuples [Meter] context_storm{.interpreted-text tuples/second role="ref"}, context_channel{.interpreted-text role="ref"}

storm.tuple.eps \'max\', eps max on 500ms [TimeValue] context_storm{.interpreted-text on eps calculated during tuples/second role="ref"}, 30s (default values) context_channel{.interpreted-text role="ref"}


Context

All above metrics are enriched in Elasticsearch backend with the following tags subfields depending on the context level :

Kafka Context


TAGS SUB FIELD DESCRIPTION UNIT


tags.kafka.cluster the kafka brokers cluster id as configured in string punchplatform.properties

tags.kafka.topic the topic name as listed in the topology settings string


Kafka Partition Context {#context_kafka_partition}

Extends context_kafka


TAGS SUB FIELD DESCRIPTION UNIT


tags.kafka.partition the partition number number


Kafka Partition Consumer Context {#context_kafka_partition_consumer}

Extends context_kafka_partition


TAGS SUB FIELD DESCRIPTION UNIT


tags.consumer.id the kafka id of consumer : storm topology string id, name of storm component, task id


Platform Context {#context_platform}


TAGS SUB FIELD DESCRIPTION UNIT


tags.pp.platform_id The logical identifier of the containing string punchplatform. This is the same as the
metrics root prefix used for ES back end. It
is used to differentiate metrics produced by
multiple PunchPlatform clusters sharing a
same metrics backend.


Channel Context {#context_channel}

Extends context_platform


TAGS SUB DESCRIPTION UNIT FIELD


tags.pp.tenant The name or codename of the tenant, as string configured in the channel and topology
configuration files

tags.pp.channel The name of the logs channel, as configured string in the channel and topology configuration
files


Channel Autotest Latency Context {#context_channel_autotestlatency}

Extends context_channel


TAGS SUB FIELD DESCRIPTION UNIT


tags.autotest_latency_path.start Injection node name : integer \<punchplatform>.\<tenant>.\<channel>.\<cluster>.\<topology>.\<component>

tags.autotest_latency_path.end Current node name : integer \<punchplatform>.\<tenant>.\<channel>.\<cluster>.\<topology>.\<component>


Job Context {#context_job}

Extends context_channel


TAGS SUB DESCRIPTION UNIT FIELD


tags.pp.job_id the unique identifier of the PunchPlatform job string associated to the elasticsearch extractor
topology


Storm Context {#context_storm}

::: {.tabularcolumns} l :::


TAGS SUB FIELD DESCRIPTION UNIT


tags.storm.container_id The logicial identifier of the containing string storm cluster, as listed in the
Punchplatform.properties file for
topologies started in a cluster, or
"local\<hostname>" for topologies
started in [local]{.title-ref} mode in a
single process.

tags.storm.topology_name The logical name of the topology, as it string appears in the topology json configuration file. This is not the complete name used
by STORM, which includes a timestamping
added at channel/topology initial start
time and a unique instance identifier.

tags.storm.component_id The logical name of the storm component, string as it appears in the
[storm_settings.component]{.title-ref}
field of the spout/bolt subsection of the
topology json configuration file.

tags.storm.component_type The spout/bolt type as stated in the string "type" field of this component in the
topology json configuration file

tags.storm.task_id The internal storm component number inside integer the topology. This is useful to
distinguish between spout/bolts instances
with the same component_id, that are
executed when an storm_settings.executors higher than 1 has been configured in this
storm component subsection of the topology json configuration file

tags.hostname The local hostname of the server running string the storm component


Elasticsearch Context {#context_elasticsearch}


TAGS SUB FIELD DESCRIPTION UNIT


tags.elasticsearch.cluster the name of the elasticsearch cluster from string which documents are extracted

tags.elasticsearch.index the name of the elasticsearch index from which string documents are extracted


Ceph Context {#context_ceph}


TAGS SUB FIELD DESCRIPTION UNIT


tags.ceph.pool the name of the CEPH object pool string


Ceph Partition Context {#context_ceph_partition}


TAGS SUB FIELD DESCRIPTION UNIT


tags.ceph.topic the name of the topic string

tags.ceph.partition the partition id within the topic integer


Netty Context {#context_netty}


TAGS SUB FIELD DESCRIPTION UNIT


tags.netty.target.host The hostname or address of the host to string which data is sent.

tags.netty.target.port The udp or tcp target port to which data string is sent.

tags.netty.target.protocol Used communication protocol. string


MetricBeat

  • module: system
  • module: kafka
  • module: zookeeper

See Metricbeat documentation for further information.

::: {.note} ::: {.admonition-title} Note :::

the elasticsearch template for metricbeat is "metricbeat-*" and use the timefield name "\@timestamp" :::

System module

Core Fields


core Fields system-core contains local CPU core stats.


system.core.id type: long CPU Core number.

system.core.user.pct type: The percentage of CPU time spent in user space. On scaled_float & multi-core systems, you can have percentages that are format: percent greater than 100%. For example, if 3 cores are at 60% use, then the cpu.user_p will be 180%.

system.core.user.ticks type: long The amount of CPU time spent in user space.

system.core.system.pct type: The percentage of CPU time spent in kernel space. scaled_float & format: percent

system.core.system.ticks type: long The amount of CPU time spent in kernel space.

system.core.nice.pct type: The percentage of CPU time spent on low-priority scaled_float & processes. format: percent

system.core.nice.ticks type: long The amount of CPU time spent on low-priority processes.

system.core.idle.pct type: The percentage of CPU time spent idle. scaled_float & format: percent

system.core.idle.ticks type: long The amount of CPU time spent idle.

system.core.iowait.pct type: The percentage of CPU time spent in wait (on disk). scaled_float & format: percent

system.core.iowait.ticks type: long The amount of CPU time spent in wait (on disk).

system.core.irq.pct type: The percentage of CPU time spent servicing and scaled_float & handling hardware interrupts. format: percent

system.core.irq.ticks type: long The amount of CPU time spent servicing and handling hardware interrupts.

system.core.softirq.pct type: The percentage of CPU time spent servicing and scaled_float & handling software interrupts. format: percent

system.core.softirq.ticks type: long The amount of CPU time spent servicing and handling software interrupts.

system.core.steal.pct type: The percentage of CPU time spent in involuntary wait scaled_float & by the virtual CPU while the hypervisor was servicing format: percent another processor. Available only on Unix.

system.core.steal.ticks type: long The amount of CPU time spent in involuntary wait by the virtual CPU while the hypervisor was servicing another processor. Available only on Unix.


Cpu Fields


cpu Fields cpu contains local CPU stats.


system.cpu.cores type: long The number of CPU cores. The CPU percentages can range from [0, 100% * cores].

system.cpu.user.pct type: The percentage of CPU time spent in user space. On scaled_float & multi-core systems, you can have percentages that are format: percent greater than 100%. For example, if 3 cores are at 60% use, then the system.cpu.user.pct will be 180%.

system.cpu.system.pct type: The percentage of CPU time spent in kernel space. scaled_float & format: percent

system.cpu.nice.pct type: The percentage of CPU time spent on low-priority scaled_float & processes. format: percent

system.cpu.idle.pct type: The percentage of CPU time spent idle. scaled_float & format: percent

system.cpu.iowait.pct type: The percentage of CPU time spent in wait (on disk). scaled_float & format: percent

system.cpu.irq.pct type: The percentage of CPU time spent servicing and scaled_float & handling hardware interrupts. format: percent

system.cpu.softirq.pct type: The percentage of CPU time spent servicing and scaled_float & handling software interrupts. format: percent

system.cpu.steal.pct type: The percentage of CPU time spent in involuntary wait scaled_float & by the virtual CPU while the hypervisor was servicing format: percent another processor. Available only on Unix.

system.cpu.user.ticks type: long The amount of CPU time spent in user space.

system.cpu.system.ticks type: long The amount of CPU time spent in kernel space.

system.cpu.nice.ticks type: long The amount of CPU time spent on low-priority processes.

system.cpu.idle.ticks type: long The amount of CPU time spent idle.

system.cpu.iowait.ticks type: long The amount of CPU time spent in wait (on disk).

system.cpu.irq.ticks type: long The amount of CPU time spent servicing and handling hardware interrupts.

system.cpu.softirq.ticks type: long The amount of CPU time spent servicing and handling software interrupts.

system.cpu.steal.ticks type: long The amount of CPU time spent in involuntary wait by the virtual CPU while the hypervisor was servicing another processor. Available only on Unix.


Diskio Fields


diskio Fields disk contains disk IO metrics collected from the operating system.


system.diskio.name type: The disk name. example: sda1 keyword

system.diskio.serial_number type: The disk's serial number. This may not be provided keyword by all operating systems.

system.diskio.read.count type: long The total number of reads completed successfully.

system.diskio.write.count type: long The total number of writes completed successfully.

system.diskio.read.bytes type: long The total number of bytes read successfully. On & format: Linux this is the number of sectors read bytes multiplied by an assumed sector size of 512.

system.diskio.write.bytes type: long The total number of bytes written successfully. On & format: Linux this is the number of sectors written bytes multiplied by an assumed sector size of 512.

system.diskio.read.time type: long The total number of milliseconds spent by all reads.

system.diskio.write.time type: long The total number of milliseconds spent by all writes.

system.diskio.io.time type: long The total number of of milliseconds spent doing I/Os.


FileSystem Fields


filesystem Fields filesystem contains local filesystem stats.


system.filesystem.available type: long & format: The disk space available to an bytes unprivileged user in bytes.

system.filesystem.device_name type: keyword The disk name. For example: /dev/disk1

system.filesystem.mount_point type: keyword The mounting point. For example: /

system.filesystem.files type: long The total number of file nodes in the file system.

system.filesystem.free type: long & format: The disk space available in bytes bytes.

system.filesystem.free_files type: long The number of free file nodes in the file system.

system.filesystem.total type: long & format: The total disk space in bytes. bytes

system.filesystem.used.bytes type: long & format: The used disk space in bytes. bytes

system.filesystem.used.pct type: scaled_float The percentage of used disk & format: percent space.


Fsstat Fields


fsstat Fields system.fsstat contains filesystem metrics aggregated from all mounted filesystems, similar with what df -a prints out.


system.fsstat.count type: long Number of file systems found.

system.fsstat.total_files type: long Total number of files.

system.fsstat.total_size.free type: long Total free space. & format:
bytes

system.fsstat.total_size.used type: long Total used space. & format:
bytes

system.fsstat.total_size.total type: long Total space (used plus free). & format:
bytes


Load Fields


load Fields Load averages.


system.load.1 type: Load average for the last minute. scaled_float

system.load.5 type: Load average for the last 5 minutes. scaled_float

system.load.15 type: Load average for the last 15 minutes. scaled_float

system.load.norm.1 type: Load divided by the number of cores for scaled_float the last minute.

system.load.norm.5 type: Load divided by the number of cores for scaled_float the last 5 minutes.

system.load.norm.15 type: Load divided by the number of cores for scaled_float the last 15 minutes.


Memory Fields


memory Fields memory contains local memory stats.


system.memory.total type: long & Total memory. format: bytes

system.memory.used.bytes type: long & Used memory. format: bytes

system.memory.free type: long & The total amount of free memory in bytes. This value format: bytes does not include memory consumed by system caches and buffers (see system.memory.actual.free).

system.memory.used.pct type: The percentage of used memory. scaled_float & format: percent

system.memory.actual.used.bytes type: long & Actual used memory in bytes. It represents the format: bytes difference between the total and the available memory. The available memory depends on the OS. For more details, please check system.actual.free.

system.memory.actual.free type: long & Actual free memory in bytes. It is calculated based on format: bytes the OS. On Linux it consists of the free memory plus caches and buffers. On OSX it is a sum of free memory and the inactive memory. On Windows, it is equal to system.memory.free.

system.memory.actual.used.pct type: The percentage of actual used memory. scaled_float & format: percent

system.memory.swap.total type: long & Total swap memory. format: bytes

system.memory.swap.used.bytes type: long & Used swap memory. format: bytes

system.memory.swap.free type: long & Available swap memory. format: bytes

system.memory.swap.used.pct type: The percentage of used swap memory. scaled_float & format: percent


Network Fields


network Fields network contains network IO metrics for a single network interface.


system.network.name type: The network interface name. example: eth0 keyword

system.network.out.bytes type: long The number of bytes sent. & format:
bytes

system.network.in.bytes type: long The number of bytes received. & format:
bytes

system.network.out.packets type: long The number of packets sent.

system.network.in.packets type: long The number or packets received.

system.network.in.errors type: long The number of errors while receiving.

system.network.out.errors type: long The number of errors while sending.

system.network.in.dropped type: long The number of incoming packets that were dropped.

system.network.out.dropped type: long The number of outgoing packets that were dropped. This value is always 0 on Darwin and BSD because it is not reported by the operating system.


Process Fields


process Fields process contains process metadata, CPU metrics, and memory metrics.


system.process.name type: keyword The process name.

system.process.state type: keyword The process state. For example: "running".

system.process.pid type: long The process pid.

system.process.ppid type: long The process parent pid.

system.process.pgid type: long The process group id.

system.process.cmdline type: keyword The full command-line used to start the process, including the arguments separated by space.

system.process.username type: keyword The username of the user that created the process. If the username cannot be determined, the field will contain the user's numeric identifier (UID). On Windows, this field includes the user's domain and is formatted as domainusername.

system.process.env type: dict The environment variables used to start the process. The data is available on FreeBSD, Linux, and OS X.

system.process.cpu.user type: long The amount of CPU time the process spent in user space.

system.process.cpu.total.pct type: The percentage of CPU time spent by the process scaled_float & since the last update. Its value is similar to format: percent the %CPU value of the process displayed by the top command on Unix systems.

system.process.cpu.system type: long The amount of CPU time the process spent in kernel space.

system.process.cpu.total.ticks type: long The total CPU time spent by the process.

system.process.cpu.start_time type: date The time when the process was started.

system.process.memory.size type: long & The total virtual memory the process has. format: bytes

system.process.memory.rss.bytes type: long & The Resident Set Size. The amount of memory the format: bytes process occupied in main memory (RAM).

system.process.memory.rss.pct type: The percentage of memory the process occupied in scaled_float & main memory (RAM). format: percent

system.process.memory.share type: long & The shared memory the process uses. format: bytes

system.process.fd.open type: long The number of file descriptors open by the process.

system.process.fd.limit.soft type: long The soft limit on the number of file descriptors opened by the process. The soft limit can be changed by the process at any time.

system.process.fd.limit.hard type: long The hard limit on the number of file descriptors opened by the process. The hard limit can only be raised by root.

system.process.cgroup.id type: keyword The ID common to all cgroups associated with this task. If there isn't a common ID used by all cgroups this field will be absent.

system.process.cgroup.path type: keyword The path to the cgroup relative to the cgroup subsystem's mountpoint. If there isn't a common path used by all cgroups this field will be absent.

system.process.cgroup.cpu.id type: keyword ID of the cgroup.

system.process.cgroup.cpu.path type: keyword Path to the cgroup relative to the cgroup subsystem's mountpoint.

system.process.cgroup.cpu.cfs.period.us type: long Period of time in microseconds for how regularly a cgroup's access to CPU resources should be reallocated.

system.process.cgroup.cpu.cfs.quota.us type: long Total amount of time in microseconds for which all tasks in a cgroup can run during one period (as defined by cfs.period.us).

system.process.cgroup.cpu.cfs.shares type: long An integer value that specifies a relative share of CPU time available to the tasks in a cgroup. The value specified in the cpu.shares file must be 2 or higher.

system.process.cgroup.cpu.rt.period.us type: long Period of time in microseconds for how regularly a cgroup's access to CPU resources is reallocated.

system.process.cgroup.cpu.rt.runtime.us type: long Period of time in microseconds for the longest continuous period in which the tasks in a cgroup have access to CPU resources.

system.process.cgroup.cpu.stats.periods type: long Number of period intervals (as specified in cpu.cfs.period.us) that have elapsed.

system.process.cgroup.cpu.stats.throttled.periods type: long Number of times tasks in a cgroup have been throttled (that is, not allowed to run because they have exhausted all of the available time as specified by their quota).

system.process.cgroup.cpu.stats.throttled.ns type: long The total time duration (in nanoseconds) for which tasks in a cgroup have been throttled.

system.process.cgroup.cpuacct.id type: keyword ID of the cgroup.

system.process.cgroup.cpuacct.path type: keyword Path to the cgroup relative to the cgroup subsystem's mountpoint.

system.process.cgroup.cpuacct.total.ns type: long Total CPU time in nanoseconds consumed by all tasks in the cgroup.

system.process.cgroup.cpuacct.stats.user.ns type: long CPU time consumed by tasks in user mode.

system.process.cgroup.cpuacct.stats.system.ns type: long CPU time consumed by tasks in user (kernel) mode.

system.process.cgroup.cpuacct.percpu type: dict CPU time (in nanoseconds) consumed on each CPU by all tasks in this cgroup.

system.process.cgroup.memory.id type: keyword ID of the cgroup.

system.process.cgroup.memory.path type: keyword Path to the cgroup relative to the cgroup subsystem's mountpoint.

system.process.cgroup.memory.mem.usage.bytes type: long & Total memory usage by processes in the cgroup format: bytes (in bytes).

system.process.cgroup.memory.mem.usage.max.bytes type: long & The maximum memory used by processes in the format: bytes cgroup (in bytes).

system.process.cgroup.memory.mem.limit.bytes type: long & The maximum amount of user memory in bytes format: bytes (including file cache) that tasks in the cgroup are allowed to use.

system.process.cgroup.memory.mem.failures type: long The number of times that the memory limit (mem.limit.bytes) was reached.

system.process.cgroup.memory.memsw.usage.bytes type: long & The sum of current memory usage plus swap space format: bytes used by processes in the cgroup (in bytes).

system.process.cgroup.memory.memsw.usage.max.bytes type: long & The maximum amount of memory and swap space used format: bytes by processes in the cgroup (in bytes).

system.process.cgroup.memory.memsw.limit.bytes type: long & The maximum amount for the sum of memory and format: bytes swap usage that tasks in the cgroup are allowed to use.

system.process.cgroup.memory.memsw.failures type: long The number of times that the memory plus swap space limit (memsw.limit.bytes) was reached.

system.process.cgroup.memory.kmem.usage.bytes type: long & Total kernel memory usage by processes in the format: bytes cgroup (in bytes).

system.process.cgroup.memory.kmem.usage.max.bytes type: long & The maximum kernel memory used by processes in format: bytes the cgroup (in bytes).

system.process.cgroup.memory.kmem.limit.bytes type: long & The maximum amount of kernel memory that tasks format: bytes in the cgroup are allowed to use.

system.process.cgroup.memory.kmem.failures type: long The number of times that the memory limit (kmem.limit.bytes) was reached.

system.process.cgroup.memory.kmem_tcp.usage.bytes type: long & Total memory usage for TCP buffers in bytes. format: bytes

system.process.cgroup.memory.kmem_tcp.usage.max.bytes type: long & The maximum memory used for TCP buffers by format: bytes processes in the cgroup (in bytes).

system.process.cgroup.memory.kmem_tcp.limit.bytes type: long & The maximum amount of memory for TCP buffers format: bytes that tasks in the cgroup are allowed to use.

system.process.cgroup.memory.kmem_tcp.failures type: long The number of times that the memory limit (kmem_tcp.limit.bytes) was reached.

system.process.cgroup.memory.stats.active_anon.bytes type: long & Anonymous and swap cache on active format: bytes least-recently-used (LRU) list, including tmpfs (shmem), in bytes.

system.process.cgroup.memory.stats.active_file.bytes type: long & File-backed memory on active LRU list, in bytes. format: bytes

system.process.cgroup.memory.stats.cache.bytes type: long & Page cache, including tmpfs (shmem), in bytes. format: bytes

system.process.cgroup.memory.stats.hierarchical_memory_limit.bytes type: long & Memory limit for the hierarchy that contains the format: bytes memory cgroup, in bytes.

system.process.cgroup.memory.stats.hierarchical_memsw_limit.bytes type: long & Memory plus swap limit for the hierarchy that format: bytes contains the memory cgroup, in bytes.

system.process.cgroup.memory.stats.inactive_anon.bytes type: long & Anonymous and swap cache on inactive LRU list, format: bytes including tmpfs (shmem), in bytes

system.process.cgroup.memory.stats.inactive_file.bytes type: long & File-backed memory on inactive LRU list, in format: bytes bytes.

system.process.cgroup.memory.stats.mapped_file.bytes type: long & Size of memory-mapped mapped files, including format: bytes tmpfs (shmem), in bytes.

system.process.cgroup.memory.stats.page_faults type: long Number of times that a process in the cgroup triggered a page fault.

system.process.cgroup.memory.stats.major_page_faults type: long Number of times that a process in the cgroup triggered a major fault. "Major" faults happen when the kernel actually has to read the data from disk.

system.process.cgroup.memory.stats.pages_in type: long Number of pages paged into memory. This is a counter.

system.process.cgroup.memory.stats.pages_out type: long Number of pages paged out of memory. This is a counter.

system.process.cgroup.memory.stats.rss.bytes type: long & Anonymous and swap cache (includes transparent format: bytes hugepages), not including tmpfs (shmem), in bytes.

system.process.cgroup.memory.stats.rss_huge.bytes type: long & Number of bytes of anonymous transparent format: bytes hugepages.

system.process.cgroup.memory.stats.swap.bytes type: long & Swap usage, in bytes. format: bytes

system.process.cgroup.memory.stats.unevictable.bytes type: long & Memory that cannot be reclaimed, in bytes. format: bytes

system.process.cgroup.blkio.id type: keyword ID of the cgroup.

system.process.cgroup.blkio.path type: keyword Path to the cgroup relative to the cgroup subsystems mountpoint.

system.process.cgroup.blkio.total.bytes type: long & Total number of bytes transferred to and from format: bytes all block devices by processes in the cgroup.

system.process.cgroup.blkio.total.ios type: long Total number of I/O operations performed on all devices by processes in the cgroup as seen by the throttling policy.


Socket Fields


socket Fields TCP sockets that are active.


system.socket.direction type: How the socket was initiated. Possible values are incoming, keyword outgoing, or listening. example: incoming

system.socket.family type: Address family. example: ipv4 keyword

system.socket.local.ip type: ip Local IP address. This can be an IPv4 or IPv6 address. example: 192.0.2.1 or 2001:0DB8:ABED:8536::1

system.socket.local.port type: Local port. example: 22 long

system.socket.remote.ip type: ip Remote IP address. This can be an IPv4 or IPv6 address. example: 192.0.2.1 or 2001:0DB8:ABED:8536::1

system.socket.remote.port type: Remote port. example: 22 long

system.socket.remote.host type: PTR record associated with the remote IP. It is obtained keyword via reverse IP lookup. example: 76-211-117-36.nw.example.com.

system.socket.remote.etld_plus_one type: The effective top-level domain (eTLD) of the remote host keyword plus one more label. For example, the eTLD+1 for "foo.bar.golang.org." is "golang.org.". The data for determining the eTLD comes from an embedded copy of the data from http://publicsuffix.org.. example: example.com.

system.socket.remote.host_error type: Error describing the cause of the reverse lookup failure. keyword

system.socket.process.pid type: ID of the process that opened the socket. long

system.socket.process.command type: Name of the command (limited to 20 chars by the OS). keyword

system.socket.process.cmdline type:
keyword

system.socket.process.exe type: Absolute path to the executable. keyword

system.socket.user.id type: UID of the user running the process. long

system.socket.user.name type: Name of the user running the process. keyword