PunchPlatform Supervision (by Nagios)¶
This chapter explains :
- the available metrics to monitor the system performance and capacity
- how to monitor a platform from an external supervision system, such as Nagios
Supervision¶
This part defines the resources to monitor in Supervision system.
Overview of resources¶
To ensure that PunchPlatform system is working, supervision must at least target :
- The running status of a number of key Services (from a linux Point of view)
- The system-level resources consumption of all PunchPlatform servers (Disk space, CPU/RAM usage)
- The Health status indicators published by PunchPlatform admin service space,
Optionally, supervision should also target :
- Backlog levels from Elasticsearch (Admin)
- Supervisord error status
- Pacemaker error status
- Elasticsearch nodes count through REST API
Key Services to watch¶
On the following PunchPlatform servers, supervision must ensure service supervisor is running :
-
Elasticsearch servers
bash $ supervisorctl status elasticsearch | grep RUNNING
-
Storm slaves of all clusters (LTR, LMR, STO)
bash $ supervisorctl status storm-supervisor | grep RUNNING
-
Storm masters of all clusters (usually on LMC and LTRs)
bash $ supervisorctl status storm-nimbus | grep RUNNING $ supervisorctl status storm-ui | grep RUNNING
-
Zookeeper servers (usually on KAF and LTRs)
bash $ supervisorctl status zookeeper | grep RUNNING
-
Kafka servers
bash $ supervisorctl status kafka-<cluster_name> | grep RUNNING
Note
Take a look at punchplatform.properties or architecture documents. Usual clusters name are "front" and "back".
-
Kibana servers
sh $ supervisorctl status | grep kibana | ! grep -v RUNNING
-
Ceph servers
sh $ sudo systemctl | grep 'ceph-osd-main@' | ! grep -v RUNNING $ sudo systemctl | grep 'ceph-osd-mon@' | ! grep -v RUNNING
On all clusters servers that use Virtual IP (LTR nodes, KIB nodes, LMC admin nodes, GRAfana nodes), supervision must ensure services pacemaker and corosync are active
1 2 | $ sudo service corosync status $ sudo service pacemaker status |
PunchPlatform Health indicator API¶
To monitor the platform health using a dedicated tool (i.e. Nagios, Centreon, Zabbix, ...), the PunchPlatform exposes a JSON API. We keep an Elasticsearch resource updated with the latest platform health state.
This resources is located at <es_url>/punchplatform/api/v2
. For
example, using curl, you can fetch it with:
1 | $ curl -s localhost:9200/punchplatform/api/v2 |
The returned document will look like this one:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 | { "_index": "punchplatform", "_type": "api", "_id": "v2", "_version": 53, "found": true, "_source": { "@timestamp": "2018-05-23T14:04:02.584Z", "platform": { "health": "yellow" }, "storm": { "health": "green", "clusters": { "main": { "health": "green", "cluster": { "health": "green", "details": { "slotsFree": 30, "memAssignedPercentUtil": "0.0", [...] } }, "nimbus": { "health": "green", "details": { "status": "Leader", "port": 6627, [...] } }, [...] } } }, "elasticsearch": { "health": "yellow", "clusters": { "es_search": { "health": "yellow", "details": { "number_of_pending_tasks": 0, "cluster_name": "es_search", [...] } } } }, "zookeeper": { "health": "green", "clusters": { "common": { "health": "green", "details": { "zk_packets_sent": "164687", "zk_max_latency": "92", [...] } } } }, [...] } } |
At the top level, @timestamp
is the last update time in ISO format.
Other keys describe a PunchPlatform component present in the
punchplatform.properties <punchplatform_properties>
(Kafka, Storm, Elasticsearch, ...).
The "health" keys can take 3 different values:
green
- everything is OKyellow
- non nominal mode, a configuration problem is detected or some nodes are down but the service is still availablered
- critical failure, the service is down
This document represents the complete platform health. If you only need a subsection of it (let say to only monitor Elasticsearch), feel free to parse it. For example, curl works pretty well with Jq:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | $ curl -s localhost:9200/punchplatform/api/v2 | jq -rc '._source.elasticsearch.health' yellow $ curl -s localhost:9200/punchplatform/api/v2 | jq -rc '._source.elasticsearch' { "health": "yellow", "clusters": { "es_search": { "health": "yellow", "details": { "number_of_pending_tasks": 0, "cluster_name": "es_search", "active_shards": 5, "active_primary_shards": 5, "unassigned_shards": 1, "delayed_unassigned_shards": 0, "timed_out": false, "relocating_shards": 0, "initializing_shards": 0, "task_max_waiting_in_queue_millis": 0, "number_of_data_nodes": 1, "number_of_in_flight_fetch": 0, "active_shards_percent_as_number": 83.33333333333334, "status": "yellow", "number_of_nodes": 1 } } } } |
Note
To learn more about Jq, a lightweight and flexible command-line JSON processor, refer to the official documentation.
(Optional) Backlog Metrics Supervision¶
The "backlog" is the amount of messages that are stored in a Kafka topic, but have not yet be processed by the consuming layer(s) of PunchPlatform channels.
Because the PunchPlatform inbuilt log channels health monitoring system is based on monitoring latencies inside the log channels, the backlog raising is automatically monitored for channels that have autotest latency control paths configured to encompass the kafka buffering (i.e with at least one latency control path with a start point at a spout somewhere ABOVE the kafka layer, and an end point somewhere AFTER the kafka layer). This configuration SHOULD be done for each channel, in order to have automatic monitoring of unusual backlog (which might mean unsufficient configured processing capability for this channel, or incoming messages flood on this specific channel).
Nevertheless, it is additionnaly possible to externally configure a separate (not punchplatform-provided) alerting tool to raise alerts in case of high messages backlog : by using REST request to the elasticsearch REST API (using metrics backend Virtual IP), Supervision can retrieve mean values of the backlogs of any channels that includes a Kafka spout.
PunchPlatform Metrics¶
Metrics Introduction¶
Metrics categories¶
The PunchPlatform metrics system collects three families of metrics :
- The channel-level applicative metrics : computed by Storm components inside the topologies
- Java Virtual Machine metrics : computed by a generic plugin. The same metrics will be reported by each worker and provide information about garbage collection, cpu usage etc..
- Operating System-level metrics : collected by agents installed on PunchPlatform application servers : kafka, storm, elasticsearch, metrics backend, PunchPlatform admin servers.
Only the channel-level metrics can be activated/configured by user configuration in the channel configuration files.
Channel-level applicative metrics¶
All these metrics are storm-component-related.
Please refer to the spouts and bolts documentation for a detailed list of published metrics and associated names/context tags.
PunchPlatform Storm Java VM metrics¶
All specific PunchPlatform topologies components run inside storm 'workers' Java virtual machines or in "local mode" JVM on dedicated servers.
The following metrics groups are published for each of these JVM :
- memory (Specially interesting is the ratio memory.total.used/memory.total.max)
- pools
- threadstates
- gc
These groups are published under the following metrics path :
punchplatform.jvm.storm.
Where :
- topology_id is the STORM topology id for jvms running tasks
belonging to a storm topology. The topology id begins with
_ - jvm_process_id is the linux process id of the JVM
- jvm_host is the linux hostname on which the JVM runs
Overview¶
The PunchPlatform comes with in-built application and system level monitoring. The collected metrics are stored in an Elasticsearch backend, on top of which you can design Grafana and Kibana dashboards.
With Elasticsearch, it is not the most efficient to search/retrieve metrics based on name-pattern. Instead each metric has a (short) name, a value, and a number of tags : tenant, cluster, topology id... etc. Metrics can be requested using filters or exact values of some of these tags, which is both efficient and flexible.
With that design the name of a metric identifies a category, grouping all metrics sharing the same nature. Tags provide additionnal contextual information, to characterize typically the point of measure.
The elasticsearch query is of the form "name:uptime AND tags.storm.component_id:punch_bolt AND tags.storm.topology_name:parsing AND tags.pp_platform_id:punchplatform AND tags.pp.channel:mychannel".
Elasticsearch back-end has been selected as the PunchPlatform metrics backend because :
- It is resilient and scalable
- It allows evolutive additive tagging of metrics value, that can ease dashboarding without negative impact on existing dashboard
- It reduces the number of COTS needed for a PunchPlatform deployment (and especially, it is already included in the standalone deployment of punchplatform)
Naming convention for metrics context tags¶
All PunchPlatform metrics context tags are stored as subfields of the
"tags" field of metrics documents. When providing the tag subfield
names in the PunchPlatform documentation, the .
character identifies
nested subfields.
PunchPlatform Metrics¶
Note
the elasticsearch template for punchplaform metrics is "metrics-*" and use the timefield name "ts".
Storm Metrics¶
NAME | DESCRIPTION | [TYPE] UNIT | TAG CONTEXT |
---|---|---|---|
storm.tuple.fail | failed tuples rate | [Meter] Tuples/second | context_storm , context_channel |
storm.tuple.ack | acked Tuples rate | [Meter] Tuples/second | context_storm , context_channel |
storm.tuple.rtt | Tuple traversal time | [Histogram] milliseconds | context_storm , context_channel |
storm.tuple.pending | pending Tuples | [Counter] Tuples | context_storm , context_channel |
storm.tuple.length | average Tuple length | [Histogram] bytes | context_storm , context_channel |
storm.uptime | indicate the process is still alive | [Counter] second | context_storm , context_channel |
storm.tuple.eps | 'max', eps max on 500ms on eps calculated during 30s (default values) | [TimeValue] Tuples/second | context_storm , context_channel |
Netty Metrics¶
NAME | DESCRIPTION | [TYPE] UNIT | TAG CONTEXT |
---|---|---|---|
 netty.app.recv | decoded bytes received | [Counter] bytes | context_netty |
 netty.raw.recv | raw bytes received | [Counter] bytes | context_netty |
 netty.app.sent | decoded bytes sent | [Counter] bytes | context_netty |
 netty.raw.sent | raw bytes sent | [Counter] bytes | context_netty |
 netty.compression.recv.ratio | compression ratio of received data | [Gauge] ratio | context_netty |
 netty.compression.sent.ratio | compression ratio of sent data | [Gauge] ratio | context_netty |
Kafka Spout Metrics¶
The Kafka Spout publishes several metrics, some of them related to the various backlogs. Here is a quick view of the three published backlogs.
The so called fetch backlog ([backlog.fetch]) is the one that tells you if your consumer (aka spout) is lagging behind your producer(s). The replayable backlog tells you how many message you can potentially replay, it also has an important operational meaning.
The commit backlog is more iformational, it gives you an idea of how many message you will replay should you restart a topology.
Info
the metrics published by the spout stop being published (of course) whenever you stop your topology. However the punchplatform also publishes the [lag]{.title-ref} of all known topic/partitions for all defined consumer groups from an external monitoring service, so that you never loose visibility on your backlogs.
Here is the detailed list of metrics published by the KafkaSpout.
backlog.commit
: [Gauge] long- the commit backlog expresses the number of message that would be re-read in case of restart. This measure the gap between the latest saved committed offset and the latest offset.
- This metric is meaningfull only with the "last_committed" strategy.
backlog.fetch
: [Gauge] long- the backlog expressed in number of message that be re-read in case of restart. This measure the gap between the latest saved committed offset and the latest offset.
backlog.replayable
: [Gauge] long- the backlog expressed in greatest number of message that can be possibly reread from this partition. This is an indication of the message you can possibly replay from Kafka before they are definitively discarded.
commit.latency
: [Timer] ms- the time it takes to perform an offset commit for this partition. This gives an idea of the Kafka broker speed of handling commits.
msg.rate
: [Meter] per partition read/sec- this rate measure the number of effective read from a partition
msg.size
: [Histogram] size- the average message size
offset.ack.rate
: [Meter] acks/sec- the rate of offset acknowledgement
offset.fail.rate
: [Meter] acks/sec- the rate of offset failure
offset.earliest
: [Gauge] long- the earliest offset for this partition
offset.latest
: [Gauge] long- the latest offset for this partition
offset.committed
: [Gauge] long- the committed offset for this partition
time.current
: [Gauge] long- the time associated to the currently read message. That value is a milliseconds epoch unix timestamp.
time.delay
: [Gauge] long- the time difference in milliseconds between now and
time.current
. Visualising this gauge gives you an easy view of how late in time your consumer is.
- the time difference in milliseconds between now and
Note
all these metrics are per topic, per partition and per consumer group.
Syslog Spout Metrics {#metrics_syslog_spout}¶
The SyslogSpout publishes the following metrics :
NAME | DESCRIPTION | [TYPE] UNIT | TAG CONTEXT |
---|---|---|---|
 syslog.server.blocked_by_queue_full_ns | time elapsed in reception thread while waiting due to input queue full (may cause message loss if UDP) | [Meter] nanoseconds |  context_storm , context_channel |
File Spout Metrics¶
The FileSpout publishes the following metrics :
Elasticsearch Spout Metrics¶
The ElasticsearchSpout publishes the following metrics :
NAME | DESCRIPTION | [TYPE] UNIT | TAG CONTEXT |
---|---|---|---|
elasticsearch.spout.ack.uptotimestamp all documents/logs have [Meter] context_job {.interpreted-text |
|||
been extracted, up to milliseconds role="ref"}, | |||
this instant since context_storm {.interpreted-text |
|||
01/01/1970 role="ref"}, | |||
00:00(UTC) context_job {.interpreted-text |
|||
role="ref"}, | |||
context_storm {.interpreted-text |
|||
role="ref"}, | |||
context_elasticsearch {.interpreted-text |
|||
role="ref"} |
elasticsearch.spout.fullyacked.count number of logs/documents [Gauge] context_job
{.interpreted-text
which have been number of role="ref"},
successfully extracted documents context_storm
{.interpreted-text
and role="ref"},
processed/acknowledged context_job
{.interpreted-text
by the topology since role="ref"},
the beginning of the context_storm
{.interpreted-text
extract job, and will role="ref"},
not be replayed in case context_elasticsearch
{.interpreted-text
of topology failure. role="ref"}
Processed logs/documents
are counted only once,
even if there has been
earlier failure/retries
during the job lifetime.
elasticsearch.spout.fetch.rate number of documents [Meter] context_job
{.interpreted-text
extracted from count and rate role="ref"},
elasticsearch context_storm
{.interpreted-text
role="ref"},
context_job
{.interpreted-text
role="ref"},
context_storm
{.interpreted-text
role="ref"},
context_elasticsearch
{.interpreted-text
role="ref"}
elasticsearch.spout.fetch.timeslice.startms beginning of time slice [Meter] context_job
{.interpreted-text
for which we are milliseconds role="ref"},
currently extracting since context_storm
{.interpreted-text
documents from 01/01/1970 role="ref"},
Elasticsearch 00:00(UTC) context_job
{.interpreted-text
role="ref"},
context_storm
{.interpreted-text
role="ref"},
context_elasticsearch
{.interpreted-text
role="ref"}
Lumberjack Spout Metrics {#metrics_lumberjack_spout}¶
::: {.index} pair: LumberjackSpout; Metrics :::
The LumberjackSpout publishes the following metrics :
metrics_storm
metrics_netty
NAME | DESCRIPTION | [TYPE] UNIT | TAG CONTEXT |
---|---|---|---|
netty.lumberjack.compressed [Counter] context_storm
{.interpreted-text
bytes role="ref"},
context_channel
{.interpreted-text
role="ref"},
context_netty
{.interpreted-text
role="ref"}
netty.lumberjack.uncompressed [Counter] context_storm
{.interpreted-text
bytes role="ref"},
context_channel
{.interpreted-text
role="ref"},
context_netty
{.interpreted-text
role="ref"}
netty.lumberjack.decoded [Counter] context_storm
{.interpreted-text
bytes role="ref"},
context_channel
{.interpreted-text
role="ref"},
context_netty
{.interpreted-text
role="ref"}
Http Spout Metrics {#metrics_http_spout}¶
The HttpSpout publishes the following metrics:
metrics_storm
Admin Metrics {#metrics_kafka_admin}¶
The Admin service publishes the following metrics :
NAME | DESCRIPTION | [TYPE] UNIT | TAG CONTEXT |
---|---|---|---|
autotest_latency Latency of channel [Gauge] context_channel_autotestlatency {.interpreted-text |
|||
components long role="ref"} |
kafka.earliest-available-offset Earliest offset of kafka. [Gauge] context_platform
,
long context_kafka_partition
{.interpreted-text
role="ref"}
kafka.latest-available-offset Latest offset of kafka. [Gauge] context_platform
,
long context_kafka_partition
{.interpreted-text
role="ref"}
kafka.earliest-available-timestamp Kafka timestamp of the [Gauge] context_platform
,
earliest message. long context_kafka_partition
{.interpreted-text
role="ref"}
kafka.latest-available-timestamp kafka timestamp of latest [Gauge] context_platform
,
message. long context_kafka_partition
{.interpreted-text
role="ref"}
kafka.consumer.lag.message Difference between latest [Gauge] context_platform
,
offset and the current long context_kafka_partition_consumer
{.interpreted-text
consumer offset. role="ref"}
kafka.consumer.lag.time Difference between the [Gauge] context_platform
,
timestamp of latest long context_kafka_partition_consumer
{.interpreted-text
message and the timestamp role="ref"}
of the current massage at
consumer offset.
Kafka Bolt Metrics {#metrics_kafka_bolt}¶
The KafkaBolt publishes the following metrics :
metrics_storm
NAME | DESCRIPTION | [TYPE] UNIT | TAG CONTEXT |
---|---|---|---|
kafka.bolt.messages.bytes average message [Counter] context_storm
{.interpreted-text
size bytes role="ref"},
context_channel
{.interpreted-text
role="ref"},
context_kafka
{.interpreted-text
role="ref"}
kafka.bolt.messages.batched average Tuple [Histogram] context_storm
{.interpreted-text
length messages role="ref"},
context_channel
{.interpreted-text
role="ref"},
context_kafka
{.interpreted-text
role="ref"}
kafka.bolt.messages.rate decoded bytes [Meter] context_storm
{.interpreted-text
received message/second role="ref"},
context_channel
{.interpreted-text
role="ref"},
context_kafka
{.interpreted-text
role="ref"}
Syslog Bolt Metrics {#metrics_syslog_bolt}¶
The SyslogBolt publishes the following metrics :
metrics_storm
metrics_netty
Lumberjack Bolt Metrics {#metrics_lumberjack_bolt}¶
The LumberjackBolt publishes the following metrics, prefixed with
metrics_storm
metrics_netty
NAME | DESCRIPTION | [TYPE] UNIT | TAG CONTEXT |
---|---|---|---|
netty.lumberjack.compressed compressed bytes [Counter] context_storm
{.interpreted-text
count bytes role="ref"},
context_channel
{.interpreted-text
role="ref"},
context_netty
{.interpreted-text
role="ref"}
netty.lumberjack.decoded application bytes [Counter] context_storm
{.interpreted-text
count bytes role="ref"},
context_channel
{.interpreted-text
role="ref"},
context_netty
{.interpreted-text
role="ref"}
netty.lumberjack.uncompressed uncompressed [Counter] context_storm
{.interpreted-text
bytes bytes role="ref"},
context_channel
{.interpreted-text
role="ref"},
context_netty
{.interpreted-text
role="ref"}
Archive Processor Bolt Metrics¶
When "write_to_objects_storage" publication is activated, the Archive processor Bolt publishes the following metrics :
metrics_storm
NAME DESCRIPTION [TYPE] UNIT TAG CONTEXT
ceph.cluster.kbytes.used storage space [Gauge] instant context_storm
{.interpreted-text
used by the kilobytes count role="ref"},
cluster context_channel
{.interpreted-text
(including role="ref"},
management data) context_ceph
ceph.cluster.kbytes.free unused storage [Gauge] instant context_storm
{.interpreted-text
space available kilobytes count role="ref"},
for the cluster context_channel
{.interpreted-text
role="ref"},
context_ceph
ceph.cluster.objects.stored number of objects [Gauge] instant context_storm
{.interpreted-text
currently stored count role="ref"},
in the cluster context_channel
{.interpreted-text
role="ref"},
context_ceph
ceph.pool.kbytes.used storage space [Gauge] instant context_storm
{.interpreted-text
used specifically kiloBytes count role="ref"},
by this object context_channel
{.interpreted-text
pool in the role="ref"},
cluster context_ceph
ceph.pool.objects.stored number of objects [Gauge] instant context_storm
{.interpreted-text
currently stored count role="ref"},
in the object context_channel
{.interpreted-text
pool in the role="ref"},
cluster context_ceph
ceph.pool.objects.degraded number of objects [Gauge] instant context_storm
{.interpreted-text
with missing count role="ref"},
replica context_channel
{.interpreted-text
role="ref"},
context_ceph
ceph.pool.objects.unfound number of objects [Gauge] instant context_storm
{.interpreted-text
with unknown count role="ref"},
placement context_channel
{.interpreted-text
role="ref"},
context_ceph
ceph.pool.objects.missingonprimary number of objects [Gauge] instant context_storm
{.interpreted-text
missing in count role="ref"},
primary context_channel
{.interpreted-text
role="ref"},
context_ceph
ceph.partition.objects.stored number of objects [Gauge] instant context_storm
{.interpreted-text
currently stored count role="ref"},
in the partition context_channel
{.interpreted-text
of the topic role="ref"},
context_ceph_partition
{.interpreted-text
role="ref"}
ceph.partition.tuples.stored number of tuples [Gauge] instant context_storm
{.interpreted-text
currently stored count role="ref"},
in the partition context_channel
{.interpreted-text
of the topic role="ref"},
context_ceph_partition
{.interpreted-text
role="ref"}
ceph.partition.bytes.stored number of bytes [Gauge] instant context_storm
{.interpreted-text
currently stored bytes count role="ref"},
in the partition context_channel
{.interpreted-text
of the topic role="ref"},
context_ceph_partition
{.interpreted-text
role="ref"}
ceph.partition.uncompressed.bytes.stored number of bytes [Gauge] instant context_storm
{.interpreted-text
stored in the bytes count role="ref"},
partition of the context_channel
{.interpreted-text
topic (before role="ref"},
compression) context_ceph_partition
{.interpreted-text
role="ref"}
ceph.partition.objects.written number and rate [Meter] number context_storm
{.interpreted-text
of objects of objects role="ref"},
written in the context_channel
{.interpreted-text
topic role="ref"},
context_ceph_partition
{.interpreted-text
role="ref"}
ceph.partition.tuples.written number and rate [Meter] number context_storm
{.interpreted-text
of tuples written of role="ref"},
in the topic tuples(documents context_channel
{.interpreted-text
or logs) role="ref"},
context_ceph_partition
{.interpreted-text
role="ref"}
ceph.partition.bytes.written number of bytes [Meter] number context_storm
{.interpreted-text
written in the of bytes role="ref"},
partition of the context_channel
{.interpreted-text
topic (and rate) role="ref"},
context_ceph_partition
{.interpreted-text
role="ref"}
ceph.partition.uncompressed.bytes.written number of bytes [Meter] number context_storm
{.interpreted-text
written in the of bytes role="ref"},
partition of the context_channel
{.interpreted-text
topic and rate role="ref"},
(before context_ceph_partition
{.interpreted-text
compression) role="ref"}
FileReader Bolt Metrics¶
The FilesReaderBolt publishes the following metrics :
metrics_storm
NAME | DESCRIPTION | [TYPE] UNIT | TAG CONTEXT |
---|---|---|---|
reader.files.read files successfully [Meter] context_storm
{.interpreted-text
extracted integer role="ref"},
context_channel
{.interpreted-text
role="ref"}
reader.files.failure files that were [Meter] context_storm
{.interpreted-text
not (or not fully) integer role="ref"},
extracted context_channel
{.interpreted-text
role="ref"}
reader.lines.read lines successfully [Meter] context_storm
{.interpreted-text
extracted integer role="ref"},
context_channel
{.interpreted-text
role="ref"}
Elasticsearch Bolt Metrics¶
The ElasticsearchBolt publishes the following metrics :
metrics_storm
NAME | DESCRIPTION | [TYPE] UNIT | TAG CONTEXT |
---|---|---|---|
storm.documents.indexation.rate number of document [Meter] context_storm
{.interpreted-text
cumulate in bulk integer role="ref"},
request context_channel
{.interpreted-text
role="ref"}
storm.errors.indexation.rate number of error [Meter] context_storm
{.interpreted-text
cumulate in bulk integer role="ref"},
request context_channel
{.interpreted-text
role="ref"}
Filter Bolt Metrics¶
The FilterBolt publishes the following metrics :
metrics_storm
NAME | DESCRIPTION | [TYPE] UNIT | TAG CONTEXT |
---|---|---|---|
drop.rate drop rate of filtered logs [Meter] context_storm
{.interpreted-text
integer role="ref"},
context_channel
{.interpreted-text
role="ref"}
storm.tuple.emit emitted tuples [Meter] context_storm
{.interpreted-text
tuples/second role="ref"},
context_channel
{.interpreted-text
role="ref"}
storm.tuple.eps \'max\', eps max on 500ms [TimeValue] context_storm
{.interpreted-text
on eps calculated during tuples/second role="ref"},
30s (default values) context_channel
{.interpreted-text
role="ref"}
Context¶
All above metrics are enriched in Elasticsearch backend with the following tags subfields depending on the context level :
Kafka Context¶
TAGS SUB FIELD DESCRIPTION UNIT
tags.kafka.cluster the kafka brokers cluster id as configured in string punchplatform.properties
tags.kafka.topic the topic name as listed in the topology settings string
Kafka Partition Context {#context_kafka_partition}¶
Extends context_kafka
TAGS SUB FIELD DESCRIPTION UNIT
tags.kafka.partition the partition number number
Kafka Partition Consumer Context {#context_kafka_partition_consumer}¶
Extends context_kafka_partition
TAGS SUB FIELD DESCRIPTION UNIT
tags.consumer.id the kafka id of consumer : storm topology string id, name of storm component, task id
Platform Context {#context_platform}¶
TAGS SUB FIELD DESCRIPTION UNIT
tags.pp.platform_id The logical identifier of the containing string
punchplatform. This is the same as the
metrics root prefix used for ES back end. It
is used to differentiate metrics produced by
multiple PunchPlatform clusters sharing a
same metrics backend.
Channel Context {#context_channel}¶
Extends context_platform
TAGS SUB DESCRIPTION UNIT FIELD
tags.pp.tenant The name or codename of the tenant, as string
configured in the channel and topology
configuration files
tags.pp.channel The name of the logs channel, as configured string
in the channel and topology configuration
files
Channel Autotest Latency Context {#context_channel_autotestlatency}¶
Extends context_channel
TAGS SUB FIELD DESCRIPTION UNIT
tags.autotest_latency_path.start Injection node name : integer \<punchplatform>.\<tenant>.\<channel>.\<cluster>.\<topology>.\<component>
tags.autotest_latency_path.end Current node name : integer \<punchplatform>.\<tenant>.\<channel>.\<cluster>.\<topology>.\<component>
Job Context {#context_job}¶
Extends context_channel
TAGS SUB DESCRIPTION UNIT FIELD
tags.pp.job_id the unique identifier of the PunchPlatform job string
associated to the elasticsearch extractor
topology
Storm Context {#context_storm}¶
::: {.tabularcolumns} l :::
TAGS SUB FIELD DESCRIPTION UNIT
tags.storm.container_id The logicial identifier of the containing string
storm cluster, as listed in the
Punchplatform.properties file for
topologies started in a cluster, or
"local\<hostname>" for topologies
started in [local]{.title-ref} mode in a
single process.
tags.storm.topology_name The logical name of the topology, as it string
appears in the topology json configuration
file. This is not the complete name used
by STORM, which includes a timestamping
added at channel/topology initial start
time and a unique instance identifier.
tags.storm.component_id The logical name of the storm component, string
as it appears in the
[storm_settings.component]{.title-ref}
field of the spout/bolt subsection of the
topology json configuration file.
tags.storm.component_type The spout/bolt type as stated in the string
"type" field of this component in the
topology json configuration file
tags.storm.task_id The internal storm component number inside integer
the topology. This is useful to
distinguish between spout/bolts instances
with the same component_id, that are
executed when an storm_settings.executors
higher than 1 has been configured in this
storm component subsection of the topology
json configuration file
tags.hostname The local hostname of the server running string the storm component
Elasticsearch Context {#context_elasticsearch}¶
TAGS SUB FIELD DESCRIPTION UNIT
tags.elasticsearch.cluster the name of the elasticsearch cluster from string which documents are extracted
tags.elasticsearch.index the name of the elasticsearch index from which string documents are extracted
Ceph Context {#context_ceph}¶
TAGS SUB FIELD DESCRIPTION UNIT
tags.ceph.pool the name of the CEPH object pool string
Ceph Partition Context {#context_ceph_partition}¶
TAGS SUB FIELD DESCRIPTION UNIT
tags.ceph.topic the name of the topic string
tags.ceph.partition the partition id within the topic integer
Netty Context {#context_netty}¶
TAGS SUB FIELD DESCRIPTION UNIT
tags.netty.target.host The hostname or address of the host to string which data is sent.
tags.netty.target.port The udp or tcp target port to which data string is sent.
tags.netty.target.protocol Used communication protocol. string
MetricBeat¶
- module: system
- module: kafka
- module: zookeeper
See Metricbeat documentation for further information.
::: {.note} ::: {.admonition-title} Note :::
the elasticsearch template for metricbeat is "metricbeat-*" and use the timefield name "\@timestamp" :::
System module¶
Core Fields¶
core Fields system-core contains local CPU core stats.
system.core.id type: long CPU Core number.
system.core.user.pct type: The percentage of CPU time spent in user space. On scaled_float & multi-core systems, you can have percentages that are format: percent greater than 100%. For example, if 3 cores are at 60% use, then the cpu.user_p will be 180%.
system.core.user.ticks type: long The amount of CPU time spent in user space.
system.core.system.pct type: The percentage of CPU time spent in kernel space. scaled_float & format: percent
system.core.system.ticks type: long The amount of CPU time spent in kernel space.
system.core.nice.pct type: The percentage of CPU time spent on low-priority scaled_float & processes. format: percent
system.core.nice.ticks type: long The amount of CPU time spent on low-priority processes.
system.core.idle.pct type: The percentage of CPU time spent idle. scaled_float & format: percent
system.core.idle.ticks type: long The amount of CPU time spent idle.
system.core.iowait.pct type: The percentage of CPU time spent in wait (on disk). scaled_float & format: percent
system.core.iowait.ticks type: long The amount of CPU time spent in wait (on disk).
system.core.irq.pct type: The percentage of CPU time spent servicing and scaled_float & handling hardware interrupts. format: percent
system.core.irq.ticks type: long The amount of CPU time spent servicing and handling hardware interrupts.
system.core.softirq.pct type: The percentage of CPU time spent servicing and scaled_float & handling software interrupts. format: percent
system.core.softirq.ticks type: long The amount of CPU time spent servicing and handling software interrupts.
system.core.steal.pct type: The percentage of CPU time spent in involuntary wait scaled_float & by the virtual CPU while the hypervisor was servicing format: percent another processor. Available only on Unix.
system.core.steal.ticks type: long The amount of CPU time spent in involuntary wait by the virtual CPU while the hypervisor was servicing another processor. Available only on Unix.
Cpu Fields¶
cpu Fields cpu contains local CPU stats.
system.cpu.cores type: long The number of CPU cores. The CPU percentages can range from [0, 100% * cores].
system.cpu.user.pct type: The percentage of CPU time spent in user space. On scaled_float & multi-core systems, you can have percentages that are format: percent greater than 100%. For example, if 3 cores are at 60% use, then the system.cpu.user.pct will be 180%.
system.cpu.system.pct type: The percentage of CPU time spent in kernel space. scaled_float & format: percent
system.cpu.nice.pct type: The percentage of CPU time spent on low-priority scaled_float & processes. format: percent
system.cpu.idle.pct type: The percentage of CPU time spent idle. scaled_float & format: percent
system.cpu.iowait.pct type: The percentage of CPU time spent in wait (on disk). scaled_float & format: percent
system.cpu.irq.pct type: The percentage of CPU time spent servicing and scaled_float & handling hardware interrupts. format: percent
system.cpu.softirq.pct type: The percentage of CPU time spent servicing and scaled_float & handling software interrupts. format: percent
system.cpu.steal.pct type: The percentage of CPU time spent in involuntary wait scaled_float & by the virtual CPU while the hypervisor was servicing format: percent another processor. Available only on Unix.
system.cpu.user.ticks type: long The amount of CPU time spent in user space.
system.cpu.system.ticks type: long The amount of CPU time spent in kernel space.
system.cpu.nice.ticks type: long The amount of CPU time spent on low-priority processes.
system.cpu.idle.ticks type: long The amount of CPU time spent idle.
system.cpu.iowait.ticks type: long The amount of CPU time spent in wait (on disk).
system.cpu.irq.ticks type: long The amount of CPU time spent servicing and handling hardware interrupts.
system.cpu.softirq.ticks type: long The amount of CPU time spent servicing and handling software interrupts.
system.cpu.steal.ticks type: long The amount of CPU time spent in involuntary wait by the virtual CPU while the hypervisor was servicing another processor. Available only on Unix.
Diskio Fields¶
diskio Fields disk contains disk IO metrics collected from the operating system.
system.diskio.name type: The disk name. example: sda1 keyword
system.diskio.serial_number type: The disk's serial number. This may not be provided keyword by all operating systems.
system.diskio.read.count type: long The total number of reads completed successfully.
system.diskio.write.count type: long The total number of writes completed successfully.
system.diskio.read.bytes type: long The total number of bytes read successfully. On & format: Linux this is the number of sectors read bytes multiplied by an assumed sector size of 512.
system.diskio.write.bytes type: long The total number of bytes written successfully. On & format: Linux this is the number of sectors written bytes multiplied by an assumed sector size of 512.
system.diskio.read.time type: long The total number of milliseconds spent by all reads.
system.diskio.write.time type: long The total number of milliseconds spent by all writes.
system.diskio.io.time type: long The total number of of milliseconds spent doing I/Os.
FileSystem Fields¶
filesystem Fields filesystem contains local filesystem stats.
system.filesystem.available type: long & format: The disk space available to an bytes unprivileged user in bytes.
system.filesystem.device_name type: keyword The disk name. For example: /dev/disk1
system.filesystem.mount_point type: keyword The mounting point. For example: /
system.filesystem.files type: long The total number of file nodes in the file system.
system.filesystem.free type: long & format: The disk space available in bytes bytes.
system.filesystem.free_files type: long The number of free file nodes in the file system.
system.filesystem.total type: long & format: The total disk space in bytes. bytes
system.filesystem.used.bytes type: long & format: The used disk space in bytes. bytes
system.filesystem.used.pct type: scaled_float The percentage of used disk & format: percent space.
Fsstat Fields¶
fsstat Fields system.fsstat contains filesystem metrics aggregated from all mounted filesystems, similar with what df -a prints out.
system.fsstat.count type: long Number of file systems found.
system.fsstat.total_files type: long Total number of files.
system.fsstat.total_size.free type: long Total free space.
& format:
bytes
system.fsstat.total_size.used type: long Total used space.
& format:
bytes
system.fsstat.total_size.total type: long Total space (used plus free).
& format:
bytes
Load Fields¶
load Fields Load averages.
system.load.1 type: Load average for the last minute. scaled_float
system.load.5 type: Load average for the last 5 minutes. scaled_float
system.load.15 type: Load average for the last 15 minutes. scaled_float
system.load.norm.1 type: Load divided by the number of cores for scaled_float the last minute.
system.load.norm.5 type: Load divided by the number of cores for scaled_float the last 5 minutes.
system.load.norm.15 type: Load divided by the number of cores for scaled_float the last 15 minutes.
Memory Fields¶
memory Fields memory contains local memory stats.
system.memory.total type: long & Total memory. format: bytes
system.memory.used.bytes type: long & Used memory. format: bytes
system.memory.free type: long & The total amount of free memory in bytes. This value format: bytes does not include memory consumed by system caches and buffers (see system.memory.actual.free).
system.memory.used.pct type: The percentage of used memory. scaled_float & format: percent
system.memory.actual.used.bytes type: long & Actual used memory in bytes. It represents the format: bytes difference between the total and the available memory. The available memory depends on the OS. For more details, please check system.actual.free.
system.memory.actual.free type: long & Actual free memory in bytes. It is calculated based on format: bytes the OS. On Linux it consists of the free memory plus caches and buffers. On OSX it is a sum of free memory and the inactive memory. On Windows, it is equal to system.memory.free.
system.memory.actual.used.pct type: The percentage of actual used memory. scaled_float & format: percent
system.memory.swap.total type: long & Total swap memory. format: bytes
system.memory.swap.used.bytes type: long & Used swap memory. format: bytes
system.memory.swap.free type: long & Available swap memory. format: bytes
system.memory.swap.used.pct type: The percentage of used swap memory. scaled_float & format: percent
Network Fields¶
network Fields network contains network IO metrics for a single network interface.
system.network.name type: The network interface name. example: eth0 keyword
system.network.out.bytes type: long The number of bytes sent.
& format:
bytes
system.network.in.bytes type: long The number of bytes received.
& format:
bytes
system.network.out.packets type: long The number of packets sent.
system.network.in.packets type: long The number or packets received.
system.network.in.errors type: long The number of errors while receiving.
system.network.out.errors type: long The number of errors while sending.
system.network.in.dropped type: long The number of incoming packets that were dropped.
system.network.out.dropped type: long The number of outgoing packets that were dropped. This value is always 0 on Darwin and BSD because it is not reported by the operating system.
Process Fields¶
process Fields process contains process metadata, CPU metrics, and memory metrics.
system.process.name type: keyword The process name.
system.process.state type: keyword The process state. For example: "running".
system.process.pid type: long The process pid.
system.process.ppid type: long The process parent pid.
system.process.pgid type: long The process group id.
system.process.cmdline type: keyword The full command-line used to start the process, including the arguments separated by space.
system.process.username type: keyword The username of the user that created the process. If the username cannot be determined, the field will contain the user's numeric identifier (UID). On Windows, this field includes the user's domain and is formatted as domainusername.
system.process.env type: dict The environment variables used to start the process. The data is available on FreeBSD, Linux, and OS X.
system.process.cpu.user type: long The amount of CPU time the process spent in user space.
system.process.cpu.total.pct type: The percentage of CPU time spent by the process scaled_float & since the last update. Its value is similar to format: percent the %CPU value of the process displayed by the top command on Unix systems.
system.process.cpu.system type: long The amount of CPU time the process spent in kernel space.
system.process.cpu.total.ticks type: long The total CPU time spent by the process.
system.process.cpu.start_time type: date The time when the process was started.
system.process.memory.size type: long & The total virtual memory the process has. format: bytes
system.process.memory.rss.bytes type: long & The Resident Set Size. The amount of memory the format: bytes process occupied in main memory (RAM).
system.process.memory.rss.pct type: The percentage of memory the process occupied in scaled_float & main memory (RAM). format: percent
system.process.memory.share type: long & The shared memory the process uses. format: bytes
system.process.fd.open type: long The number of file descriptors open by the process.
system.process.fd.limit.soft type: long The soft limit on the number of file descriptors opened by the process. The soft limit can be changed by the process at any time.
system.process.fd.limit.hard type: long The hard limit on the number of file descriptors opened by the process. The hard limit can only be raised by root.
system.process.cgroup.id type: keyword The ID common to all cgroups associated with this task. If there isn't a common ID used by all cgroups this field will be absent.
system.process.cgroup.path type: keyword The path to the cgroup relative to the cgroup subsystem's mountpoint. If there isn't a common path used by all cgroups this field will be absent.
system.process.cgroup.cpu.id type: keyword ID of the cgroup.
system.process.cgroup.cpu.path type: keyword Path to the cgroup relative to the cgroup subsystem's mountpoint.
system.process.cgroup.cpu.cfs.period.us type: long Period of time in microseconds for how regularly a cgroup's access to CPU resources should be reallocated.
system.process.cgroup.cpu.cfs.quota.us type: long Total amount of time in microseconds for which all tasks in a cgroup can run during one period (as defined by cfs.period.us).
system.process.cgroup.cpu.cfs.shares type: long An integer value that specifies a relative share of CPU time available to the tasks in a cgroup. The value specified in the cpu.shares file must be 2 or higher.
system.process.cgroup.cpu.rt.period.us type: long Period of time in microseconds for how regularly a cgroup's access to CPU resources is reallocated.
system.process.cgroup.cpu.rt.runtime.us type: long Period of time in microseconds for the longest continuous period in which the tasks in a cgroup have access to CPU resources.
system.process.cgroup.cpu.stats.periods type: long Number of period intervals (as specified in cpu.cfs.period.us) that have elapsed.
system.process.cgroup.cpu.stats.throttled.periods type: long Number of times tasks in a cgroup have been throttled (that is, not allowed to run because they have exhausted all of the available time as specified by their quota).
system.process.cgroup.cpu.stats.throttled.ns type: long The total time duration (in nanoseconds) for which tasks in a cgroup have been throttled.
system.process.cgroup.cpuacct.id type: keyword ID of the cgroup.
system.process.cgroup.cpuacct.path type: keyword Path to the cgroup relative to the cgroup subsystem's mountpoint.
system.process.cgroup.cpuacct.total.ns type: long Total CPU time in nanoseconds consumed by all tasks in the cgroup.
system.process.cgroup.cpuacct.stats.user.ns type: long CPU time consumed by tasks in user mode.
system.process.cgroup.cpuacct.stats.system.ns type: long CPU time consumed by tasks in user (kernel) mode.
system.process.cgroup.cpuacct.percpu type: dict CPU time (in nanoseconds) consumed on each CPU by all tasks in this cgroup.
system.process.cgroup.memory.id type: keyword ID of the cgroup.
system.process.cgroup.memory.path type: keyword Path to the cgroup relative to the cgroup subsystem's mountpoint.
system.process.cgroup.memory.mem.usage.bytes type: long & Total memory usage by processes in the cgroup format: bytes (in bytes).
system.process.cgroup.memory.mem.usage.max.bytes type: long & The maximum memory used by processes in the format: bytes cgroup (in bytes).
system.process.cgroup.memory.mem.limit.bytes type: long & The maximum amount of user memory in bytes format: bytes (including file cache) that tasks in the cgroup are allowed to use.
system.process.cgroup.memory.mem.failures type: long The number of times that the memory limit (mem.limit.bytes) was reached.
system.process.cgroup.memory.memsw.usage.bytes type: long & The sum of current memory usage plus swap space format: bytes used by processes in the cgroup (in bytes).
system.process.cgroup.memory.memsw.usage.max.bytes type: long & The maximum amount of memory and swap space used format: bytes by processes in the cgroup (in bytes).
system.process.cgroup.memory.memsw.limit.bytes type: long & The maximum amount for the sum of memory and format: bytes swap usage that tasks in the cgroup are allowed to use.
system.process.cgroup.memory.memsw.failures type: long The number of times that the memory plus swap space limit (memsw.limit.bytes) was reached.
system.process.cgroup.memory.kmem.usage.bytes type: long & Total kernel memory usage by processes in the format: bytes cgroup (in bytes).
system.process.cgroup.memory.kmem.usage.max.bytes type: long & The maximum kernel memory used by processes in format: bytes the cgroup (in bytes).
system.process.cgroup.memory.kmem.limit.bytes type: long & The maximum amount of kernel memory that tasks format: bytes in the cgroup are allowed to use.
system.process.cgroup.memory.kmem.failures type: long The number of times that the memory limit (kmem.limit.bytes) was reached.
system.process.cgroup.memory.kmem_tcp.usage.bytes type: long & Total memory usage for TCP buffers in bytes. format: bytes
system.process.cgroup.memory.kmem_tcp.usage.max.bytes type: long & The maximum memory used for TCP buffers by format: bytes processes in the cgroup (in bytes).
system.process.cgroup.memory.kmem_tcp.limit.bytes type: long & The maximum amount of memory for TCP buffers format: bytes that tasks in the cgroup are allowed to use.
system.process.cgroup.memory.kmem_tcp.failures type: long The number of times that the memory limit (kmem_tcp.limit.bytes) was reached.
system.process.cgroup.memory.stats.active_anon.bytes type: long & Anonymous and swap cache on active format: bytes least-recently-used (LRU) list, including tmpfs (shmem), in bytes.
system.process.cgroup.memory.stats.active_file.bytes type: long & File-backed memory on active LRU list, in bytes. format: bytes
system.process.cgroup.memory.stats.cache.bytes type: long & Page cache, including tmpfs (shmem), in bytes. format: bytes
system.process.cgroup.memory.stats.hierarchical_memory_limit.bytes type: long & Memory limit for the hierarchy that contains the format: bytes memory cgroup, in bytes.
system.process.cgroup.memory.stats.hierarchical_memsw_limit.bytes type: long & Memory plus swap limit for the hierarchy that format: bytes contains the memory cgroup, in bytes.
system.process.cgroup.memory.stats.inactive_anon.bytes type: long & Anonymous and swap cache on inactive LRU list, format: bytes including tmpfs (shmem), in bytes
system.process.cgroup.memory.stats.inactive_file.bytes type: long & File-backed memory on inactive LRU list, in format: bytes bytes.
system.process.cgroup.memory.stats.mapped_file.bytes type: long & Size of memory-mapped mapped files, including format: bytes tmpfs (shmem), in bytes.
system.process.cgroup.memory.stats.page_faults type: long Number of times that a process in the cgroup triggered a page fault.
system.process.cgroup.memory.stats.major_page_faults type: long Number of times that a process in the cgroup triggered a major fault. "Major" faults happen when the kernel actually has to read the data from disk.
system.process.cgroup.memory.stats.pages_in type: long Number of pages paged into memory. This is a counter.
system.process.cgroup.memory.stats.pages_out type: long Number of pages paged out of memory. This is a counter.
system.process.cgroup.memory.stats.rss.bytes type: long & Anonymous and swap cache (includes transparent format: bytes hugepages), not including tmpfs (shmem), in bytes.
system.process.cgroup.memory.stats.rss_huge.bytes type: long & Number of bytes of anonymous transparent format: bytes hugepages.
system.process.cgroup.memory.stats.swap.bytes type: long & Swap usage, in bytes. format: bytes
system.process.cgroup.memory.stats.unevictable.bytes type: long & Memory that cannot be reclaimed, in bytes. format: bytes
system.process.cgroup.blkio.id type: keyword ID of the cgroup.
system.process.cgroup.blkio.path type: keyword Path to the cgroup relative to the cgroup subsystems mountpoint.
system.process.cgroup.blkio.total.bytes type: long & Total number of bytes transferred to and from format: bytes all block devices by processes in the cgroup.
system.process.cgroup.blkio.total.ios type: long Total number of I/O operations performed on all devices by processes in the cgroup as seen by the throttling policy.
Socket Fields¶
socket Fields TCP sockets that are active.
system.socket.direction type: How the socket was initiated. Possible values are incoming, keyword outgoing, or listening. example: incoming
system.socket.family type: Address family. example: ipv4 keyword
system.socket.local.ip type: ip Local IP address. This can be an IPv4 or IPv6 address. example: 192.0.2.1 or 2001:0DB8:ABED:8536::1
system.socket.local.port type: Local port. example: 22 long
system.socket.remote.ip type: ip Remote IP address. This can be an IPv4 or IPv6 address. example: 192.0.2.1 or 2001:0DB8:ABED:8536::1
system.socket.remote.port type: Remote port. example: 22 long
system.socket.remote.host type: PTR record associated with the remote IP. It is obtained keyword via reverse IP lookup. example: 76-211-117-36.nw.example.com.
system.socket.remote.etld_plus_one type: The effective top-level domain (eTLD) of the remote host keyword plus one more label. For example, the eTLD+1 for "foo.bar.golang.org." is "golang.org.". The data for determining the eTLD comes from an embedded copy of the data from http://publicsuffix.org.. example: example.com.
system.socket.remote.host_error type: Error describing the cause of the reverse lookup failure. keyword
system.socket.process.pid type: ID of the process that opened the socket. long
system.socket.process.command type: Name of the command (limited to 20 chars by the OS). keyword
system.socket.process.cmdline type:
keyword
system.socket.process.exe type: Absolute path to the executable. keyword
system.socket.user.id type: UID of the user running the process. long
system.socket.user.name type: Name of the user running the process. keyword