Skip to content

Metrics in Apache Storm

By Technologies

Storm Tuples

The tuples metrics are published by all the existing Spouts.

Metric context: default, storm component

  • storm.tuple.fail [Meter] Tuples/second

    failed tuples rate

  • storm.tuple.ack [Meter] Tuples/second

    acked Tuples rate

  • storm.tuple.rtt [Histogram] milliseconds

    Tuple traversal time

  • storm.tuple.pending [Counter] Tuples

    pending Tuples

  • storm.tuple.length [Histogram] bytes

    average Tuple length

  • storm.tuple.eps [TimeValue] Tuples/second

    'max', eps max on 500ms on eps calculated during 30s (default values)

Storm Latency

If a spout has the latency activated, all the following spouts and bolts will publish this metric. Refer to the spouts common configuration to see how it works.

Metric context: default, storm component

  • storm.latency [Latency] milliseconds

    Compute the time elapsed from the initial spout generation (storm.latency.start) to the current spout or bolt. This difference is given in milliseconds in the field storm.latency.diff

Storm Topology

These metrics are related to Storm "workers" which means that any running topology will publish them.

Metric context: default

  • storm.worker.uptime [Counter] second

    indicate if the process is still alive

  • storm.worker.memory [Gauge] bytes

    the memory consummed by the worker topology

Netty commons

These metrics are published by all socket-related spouts that underneath rely on the Netty library (i.e. syslog_spout, lumberjack_spout, ...).

Metric context: default, storm component, netty

  • netty.app.recv [Counter] bytes

    decoded bytes received

  • netty.raw.recv [Counter] bytes

    raw bytes received

  • netty.app.sent [Counter] bytes

    decoded bytes sent

  • netty.raw.sent [Counter] bytes

    raw bytes sent

  • netty.compression.recv.ratio [Gauge] ratio

    compression ratio of received data

  • netty.compression.sent.ratio [Gauge] ratio

    compression ratio of sent data

Kafka Spout

The Kafka Spout publishes several metrics, some of them related to the various backlogs. Here is a quick view of the three published backlogs.

image

The so called fetch backlog (backlog.fetch) is the one that tells you if your consumer (aka spout) is lagging behind your producer(s). The replayable backlog tells you how many message you can potentially replay, it also has an important operational meaning.

The commit backlog is more iformational, it gives you an idea of how many message you will replay should you restart a topology.

Info

the metrics published by the spout stop being published (of course) whenever you stop your topology. However the punchplatform also publishes the lag of all known topic/partitions for all defined consumer groups from an external monitoring service, so that you never loose visibility on your backlogs.

Metrics context: default, storm component, kafka, kafka partition, kafka consumer

Metrics published:

  • kafka.spout.backlog.commit [Gauge] long

    the commit backlog expresses the number of message that would be re-read in case of restart. This measure the gap between the latest saved committed offset and the latest offset. This metric is meaningfull only with the "last_committed" strategy.

  • kafka.spout.backlog.fetch [Gauge] long

    the backlog expressed in number of message that be re-read in case of restart. This measure the gap between the latest saved committed offset and the latest offset.

  • kafka.spout.backlog.replayable [Gauge] long

    the backlog expressed in greatest number of message that can be possibly reread from this partition. This is an indication of the message you can possibly replay from Kafka before they are definitively discarded.

  • kafka.spout.commit.latency [Timer] ms

    the time it takes to perform an offset commit for this partition. This gives an idea of the Kafka broker speed of handling commits.

  • kafka.spout.msg.rate [Meter] per partition read/sec

    this rate measure the number of effective read from a partition

  • kafka.spout.msg.size [Histogram] size

    the average message size

  • kafka.spout.offset.ack.rate [Meter] acks/sec

    the rate of offset acknowledgement

  • kafka.spout.offset.fail.rate [Meter] acks/sec

    the rate of offset failure

  • kafka.spout.offset.earliest [Gauge] long

    the earliest offset for this partition

  • kafka.spout.offset.latest [Gauge] long

    the latest offset for this partition

  • kafka.spout.offset.committed [Gauge] long

    the committed offset for this partition

  • kafka.spout.time.current [Gauge] long

    the time associated to the currently read message. That value is a milliseconds epoch unix timestamp.

  • kafka.spout.time.delay [Gauge] long

    the time difference in milliseconds between now and time.current. Visualising this gauge gives you an easy view of how late in time your consumer is.

  • kafka.spout.lumberjack.compressed

    refer to Lumberjack Spout

  • kafka.spout.lumberjack.decoded

    refer to Lumberjack Spout

  • kafka.spout.lumberjack.uncompressed

    refer to Lumberjack Spout

Note

all these metrics are per topic, per partition and per consumer group.

Syslog Spout

Metric context: default, storm component, netty

  • syslog.server.blocked_by_queue_full_ns [Meter] nanoseconds

    time elapsed in reception thread while waiting due to input queue full (may cause message loss if UDP)

File Spout

The file_spout only publishes the common metrics coming from topology and tuples.

Elasticsearch Spout

Metrics context: default, storm component, elasticsearch

  • elasticsearch.spout.ack.uptotimestamp [Meter] milliseconds epoch

    all documents/logs have been extracted, up to this instant

  • elasticsearch.spout.fullyacked.count [Gauge] number of documents

    number of logs/documents which have been successfully extracted and processed/acknowledged by the topology since the beginning of the extract job, and will not be replayed in case of topology failure. Processed logs/documents are counted only once, even if there has been earlier failure/retries during the job lifetime.

  • elasticsearch.spout.fetch.rate [Meter]

    number of documents extracted from count and rate elasticsearch

  • elasticsearch.spout.fetch.timeslice.startms [Meter] milliseconds epoch

    beginning of time slice for which we are currently extracting documents from Elasticsearch

Lumberjack Spout

Metric context: default, storm component, netty

  • netty.lumberjack.compressed [Counter] bytes

    compressed bytes count

  • netty.lumberjack.uncompressed [Counter] bytes

    uncompressed bytes

  • netty.lumberjack.decoded [Counter] bytes

    application bytes count

Http Spout

The http_spout only publishes the common metrics coming from topology, tuples and netty.

Kafka Bolt

Metric context: default, storm component, kafka

  • kafka.bolt.messages.bytes [Counter] bytes

    average message size

  • kafka.bolt.messages.batched [Histogram] messages

    average Tuple length

  • kafka.bolt.messages.rate [Meter] message/second

    decoded bytes received

Syslog Bolt

The syslog_spout only publishes the common metrics coming from topology, tuples and netty.

Lumberjack Bolt

Metric context: default, storm component, netty

  • netty.lumberjack.compressed [Counter] bytes

    compressed bytes count

  • netty.lumberjack.decoded [Counter] bytes

    application bytes count

  • netty.lumberjack.uncompressed [Counter] bytes

    uncompressed bytes

Archive Processor Bolt

Metric context: default, storm component, ceph, ceph partition

When "write_to_objects_storage" publication is activated, the Archive processor Bolt publishes the following metrics :

  • ceph.cluster.kbytes.used [Gauge] instant kilobytes count

    storage space used by the cluster (including management data)

  • ceph.cluster.kbytes.free [Gauge] instant kilobytes count

    unused storage space available for the cluster

  • ceph.cluster.objects.stored [Gauge] instant count

    number of objects currently stored in the cluster

  • ceph.pool.kbytes.used [Gauge] instant kiloBytes count

    storage space used specifically by this object pool in the cluster

  • ceph.pool.objects.stored [Gauge] instant count

    number of objects currently stored in the object pool in the cluster

  • ceph.pool.objects.degraded [Gauge] instant count

    number of objects with missing replica

  • ceph.pool.objects.unfound [Gauge] instant count

    number of objects with unknown placement

  • ceph.pool.objects.missingonprimary [Gauge] instant count

    number of objects missing in primary

  • ceph.partition.objects.stored [Gauge] instant count

    number of objects currently stored in the partition of the topic

  • ceph.partition.tuples.stored [Gauge] instant count

    number of tuples currently stored in the partition of the topic

  • ceph.partition.bytes.stored [Gauge] instant bytes count

    number of bytes currently stored in the partition of the topic

  • ceph.partition.uncompressed.bytes.stored [Gauge] instant bytes count

    number of bytes stored in the partition of the topic (before compression)

  • ceph.partition.objects.written [Meter] number of objects

    number and rate of objects written in the topic

  • ceph.partition.tuples.written [Meter] number of tuples(documents or logs)

    number and rate of tuples written in the topic

  • ceph.partition.bytes.written [Meter] number of bytes

    number of bytes written in the partition of the topic (and rate)

  • ceph.partition.uncompressed.bytes.written [Meter] number of bytes

    number of bytes written in the partition of the topic and rate (before compression)

FileReader Bolt

Metric context: default, storm component

The FilesReaderBolt publishes the following metrics :

  • reader.files.read [Meter] integer

    files successfully extracted

  • reader.files.failure [Meter] integer

    files that were not (or not fully) extracted

  • reader.lines.read [Meter] integer

    lines successfully extracted

Elasticsearch Bolt

Metric context: default, storm component

  • storm.documents.indexation.rate [Meter] integer

    number of document cumulate in bulk request

  • storm.errors.indexation.rate [Meter] integer

    number of error cumulate in bulk request

Filter Bolt

Metric context: default, storm component

  • drop.rate [Meter] integer

    drop rate of filtered logs

  • storm.tuple.emit [Meter] tuples/second

    emitted tuples

  • storm.tuple.eps [TimeValue] tuples/second

    "max", eps max on 500ms on eps calculated during 30s (default values)

By Context

All above metrics are enriched in Elasticsearch backend with the following tags subfields depending on the context level :

Default context

  • name (string)

    The metrics name identifier

  • type (string)

    Define the metrics related technology. In that case, its value is always set to "storm".

  • rep.host.name (string)

    The local hostname of the server running the storm component

  • platform.id (string)

    The logical identifier of the containing punchplatform. This is the same as the
    metrics root prefix used for ES back end. It is used to differentiate metrics produced by multiple PunchPlatform clusters sharing a same metrics backend.

  • platform.tenant (string)

    The name or codename of the tenant, as configured in the channel and topology configuration files

  • platform.channel (string)

    The name of the logs channel, as configured in the channel and topology configuration files

  • platform.storm.container_id (string)

    The logicial identifier of the containing storm cluster, as listed in the Punchplatform.properties file for topologies started in a cluster, or "local" for topologies started in local mode in a single process.

  • platform.storm.topology (string)

    The logical name of the topology, as it appears in the topology json configuration file. This is not the complete name used by STORM, which includes a timestamping
    added at channel/topology initial start time and a unique instance identifier.

Storm Component Context

  • platform.storm.component.name (string)

    The logical name of the storm component, as it appears in the storm_settings.component field of the spout/bolt subsection of the
    topology json configuration file.

  • platform.storm.component.type (string)

    The spout/bolt type as stated in the "type" field of this component in the topology json configuration file

  • platform.storm.component.task_id (integer)

    The internal storm component number inside the topology. This is useful to distinguish between spout/bolts instances with the same component_id, that are executed when an storm_settings.executors higher than 1 has been configured in this storm component subsection of the topology json configuration file

Kafka Context

  • kafka.cluster (string)

    the kafka brokers cluster id as configured in punchplatform.properties

  • kafka.topic (string)

    the topic name as listed in the topology settings

Kafka Partition Context

Extends Kafka context

  • kafka.partition (integer)

    the partition number

Kafka Partition Consumer Context

Extends Kafka Partition Context

  • consumer.id (string)

    the kafka id of consumer : storm topology id, name of storm component, task id

Elasticsearch Context

  • elasticsearch.cluster (string)

    the name of the elasticsearch cluster from which documents are extracted

  • elasticsearch.index (string)

    the name of the elasticsearch index from which documents are extracted

Ceph Context

  • ceph.pool (string)

    the name of the CEPH object pool

Ceph Partition Context

  • ceph.topic (string)

    the name of the topic

  • ceph.partition (integer)

    the partition id within the topic

Netty Context

  • netty.target.host (string)

    The hostname or address of the host to which data is sent.

  • netty.target.port (string)

    The udp or tcp target port to which data is sent.

  • netty.target.protocol (string)

    Used communication protocol.