Metrics Overview¶

Metrics Introduction¶

Metrics categories¶

The PunchPlatform metrics system collects multiples families of metrics :

The channel-level applicative metrics : computed by Storm or Spark components
Java Virtual Machine metrics : computed by a generic plugin. The same metrics will be reported by each worker and provide information about garbage collection, cpu usage, etc.
Platform components level : collected by a dedicated service that check the platform services health (such as kafka, storm, elasticsearch, ...)
Operating System-level metrics : collected by agents (such as metricbeat, packetbeat, ...) installed on PunchPlatform application servers (cpu, disk usage, network usage, ...)

Only the channel-level metrics can be activated/configured by user configuration in the channel configuration files.

Channel-level applicative metrics¶

Most of these metrics are storm(-like) application nodes-related. ).

For Spark/pyspark applications (aggregations, Spark/pyspark batch processing) you can leverage Spark nodes metrics.

Please refer to each node documentation for a detailed list of published specific metrics and associated names/context tags.

PunchPlatform Storm Java VM metrics¶

Storm punchlines run inside storm worker Java virtual machines or in lightweight punch jvm if you use storm compatible light punch engine. In either cases, the following metrics groups are published for each of these JVM :

memory (Specially interesting is the ratio memory.total.used/memory.total.max)
pools
threadstates
gc

These groups are published under the following metrics path :

punchplatform.jvm.storm.<topology_id>.<jvm_process_id>@<jvm_host>

Where :

topology_id is the Storm topology id for jvms running tasks belonging to a Storm topology. The topology id begins with <tenant>_<channel>_<name_of_topo_in_channel>_
jvm_process_id is the linux process id of the JVM
jvm_host is the linux hostname on which the JVM runs

Overview¶

The PunchPlatform comes with in-built application and system level monitoring. The collected metrics are stored in an Elasticsearch backend, on top of which you can design Kibana dashboards.

With Elasticsearch, it is not the most efficient to search/retrieve metrics based on name-pattern. Instead each metric has a (short) name, a value, and a number of tags : tenant, cluster, topology id, etc. Metrics can be requested using filters or exact values of some of these tags, which is both efficient and flexible.

With that design the name of a metric identifies a category, grouping all metrics sharing the same nature. Tags provide additional contextual information, to characterize typically the point of measure.

Elasticsearch back-end has been selected as the main metrics backend because :

It is resilient and scalable
It allows evolutive additive tagging of metrics value, that can ease dashboard development without impact on existing ones
It reduces the number of COTS needed for a PunchPlatform deployment (already included)

Naming convention for metrics context tags¶

All PunchPlatform metrics context tags are stored as sub-fields of the "tags" field of metrics documents. When providing the tag sub-field names in the PunchPlatform documentation, the . character identifies nested sub-fields.

The Elasticsearch mapping template for PunchPlatform metrics is *-metrics-* and uses @timestamp as timestamp field.

Metrics Handling¶

The PunchPlatform Storm components (input processing and output nodes), as well as system-level metrics collector publish various metrics in real time. These metrics are of several types:

Histograms : measures the statistical distribution of values in a stream of data. In addition to minimum, maximum, mean, etc., it also measures median, 75th, 90th, 95th, 98th, 99th, and 99.9th percentiles.
Meters: measures the rate of events over time (e.g., "requests per second"). In addition to the mean rate, meters also track 1-, 5-, and 15-minute moving averages.
Counters: A counter is just a gauge for an AtomicLong instance.

These metrics are published only periodically to a metrics storage backend, either a loggers or an elasticsearch cluster. You can activate them with no impact on the platform performance, they are designed to cope with high traffic processing. All statistical calculations are performed in real time in storm workers, only the consolidated result is sent periodically to the back-end.

Metrics are not stored/indexed in Elasticsearch the way logs are indexed.

Features¶

Periodic (~10s, customizable) grouped sending of metrics values to the metrics backend(s).
"metric name/metric value" type backends :
"tagged values" type backends (i.e. additional information can accompany a metric value, instead of being inside the metric name):
Sending is non-blocking ; in case on unreachable back end, metrics values will be lost and functional behaviour of the software will continue ; when back end is available again, sending of new metrics values will succeed.
Functional parts of the software can provide instantaneous measures, or counter metrics increase.
Metrics layer takes care of associated computed statistics (counter aggregation, long term average, 1 minute mean, ) and of providing a single value by sending period ("bucketing")
"uptime" metrics are generic automatically-increased time counter metrics that indicate the process is still alive
Metrics can be sent either directly by a process to the backend(s) AND can also be forwarded through the PunchPlatform channel, and sent by another process of the channel (acting as a "metrics proxy").
Direct sending to the various backends, forwarding are settings that can be overridden at a topology setting level
Metrics proxy activation can be activated at a spout setting level

Common variations about PunchPlatform Metrics¶

For all metrics indicating event rates and latency measures, the metric is automatically declined in multiple data subseries :

m1_rate : the Exponentially Weighted Moving Average (EWMA) on a 1 minute period (period of time in minutes over which the reading is said to be averaged (the mean lifetime of each reading in the average), 60% of the value in computation).
mean_rate : a moving long-term average with an exponential temporal attenuation
count : the cumulated values that have been registered (this value is often meaningless, except for message counts, where it is actually representative of the number of messages cumulated over time)

For latency and size measures, the additional data series are provided, derived from the base measure :

stddev, the standard deviation
max : maximal value
min : minimal value

When using elasticsearch backend, these values are grouped as a single metric record, as subfields in the metric object. Therefore the series value field is the name of the metric, then a . (dot), then the subseries identifier. For example: elasticsearch.spout.fetch.rate.count and elasticsearch.spout.fetch.rate.m1_rate.

To know the the available subseries, refer to the following table, and to the metrics types as listed in the Spouts and Bolts documents.

TYPE	SUBSERIES
Counter	count
Gauge	gauge
Histogram	count, max, min, mean, stddev
Meter	count, m1_rate
Timer	count, m1_rate, max, min, mean, stddev
TimeValue	"max", "ts_diff", ... Depends on the metric

Metrics Normalization¶

This section describe the content of the metrics common fields. Some are mandatory and will be found in every metrics, other are only set when useful.

This fields normalization is essential, it is the key to create powerful Kibana dashboard based on metrics coming from various sources. Thanks to that, you will be able to apply the same filters on all the visualizations.

Document fields shared across all metrics¶

platform.id (string)

The platform unique ID when the metric has been generated. Useful when working with multiple LMR/LTR on the same project.
type (string)

Designate the metric source type. For example, its value metrics coming from Storm topologies will be "storm" or "platform" for the ones coming from the platform monitoring.

Commons fields (when relevant)¶

host.name (string)

Contrary to the event field 'rep.host.name', for metrics we decided to follow the ECS convention. This fields contains the hostname from which the measurement was taken.
platform.tenant (string)

The associated tenant
platform.channel (string)

The associated channel

MetricBeat¶

module: system
module: kafka
module: zookeeper

See the official Metricbeat documentation for further information.