ADM Training - Track2: daily operation/most frequent operator actions¶
Most frequent operator actions are related to:
- deployment or update of the running channels/applications configuration
- data lifecycle management as part of a production incident handling or a specific business request (replaying/closing/opening/deleting/extracting/reloading)
- planned maintainance of the punch platform or an external subsystem that interacts with it (stopping channels/applications/services/servers, restarting things)
Deployment or update of channels/aplication configuration¶
This mostly implies:
- importing or changing the channels configuration/resources in the operator command-line environment
startcommands of channelctl command
reload command (available since 6.3 release) is a shortcut that can be configured at channel structure level,
in order to differentiate applications that are not dangerous to stop/restart and those that should not usually be restarted without risks (the input punchlines on collectors, for example).
reload command should always be used when updating some configuration. Needing to do 'stops' often imply some specific procedure to ensure best service availability.
Kafka Data/Topics Operation¶
Kafka data replay¶
One of the most frequent kafka-related operation is to obtain 'replay' of some topic data after changing some punchline configuration.
This is not actually a change of the kafka cluster, but an update of committed consumer offsets stored inside kafka.
Have a look at punchplatform-kafka-consumers.sh and especially '--reset-XXX' subcommands.
Why is Kafka not truly a queuing system ?
because kafka does not check if data has been read/processed by consumers before destroying it
Consumers are tracking to what point they have correctly read and processed each partition. This is stored in 'consumer groups' in
because multiple readers can consume the same data without interfering.
Consumers interfere (in fact cooperate by distributing the responsibility of the partitions among themselves) if they use the same
consumers group id.
Procedure steps for replaying some kafka data
consumer group namefrom the punchline configuration, and determine the current reading point/backlog by using the
stop the punchline that consumes the data, and that needs to replay data from the topic, using
punchplatform-kafka-consumers.shto change the committed offsets (resetting it to either the start of the topic with -
-reset-to-earliestor to some intermediate point by shifting backward the offset or by looking up a given date through the native kafka-consumers-group.sh command).
starting again the punchline that consumes the data
Kafka data purge or retention settings change¶
Kafka consumers processing backlog checking¶
If you want to see easily if a punchline is
lagging (i.e. has a big processing
backlog of unprocessed messages),
you can check this lag through the
punchplatform-kafka-consumers.sh --describe subcommand.
Kafka is mosthly handling recent data in memory. So if you suddenly stop multiple kafka broker nodes while some punchline is writing new messages in there, you will loose data.
A clean kafka stop implies either to
- stop only one broker node in a healthy cluster (check its replication health through
- to stop writing to the cluster, by stopping the approriate punchlines, before stopping the whole kafka cluster.
Stopping nodes is done through
sudo systemctl stop kafka-<clusterId> command on the broker node.
(Re-)starting stopped nodes is done through either (re)booting the containing VM/server, or through
sudo systemctl start kafka-<clusterId in any order.
Check the lag of the
dispatcher application of
monitoring channel in the
platform tenant of your training platform.
Elasticsearch Data/Indices operation¶
Manual data housekeeping¶
You can close/open/delete indices manually, using indices management REST API.
automatic elasticsearch housekeeping may interact with what you are doing...see details in the indices management procedure
Elasticsearch data lifecycle automation¶
For automating the purge of old data (or change in location/replication settings for such data), a specific standard punchplatform service has to be configured and run in each individual tenant.
Have a look at the available actions/settings you can configure in this Elasticsearch housekeeping service.
You can act on other things that just deletion/closure: you can move indices from fast to slow storage, or change replication in time.
A tenant 'Elasticsearch housekeeping' service can only act upon indices which names begin with
This allows different retention settings between tenants.
This prevents someone from mistakenly remove an other tenants indices by a copy-paste of housekeping settings.
Elasticsearch indexing (mappings) configuration change¶
In case new or updated channels and resources (e.g. log parsers) are deployed, or you want to change default replication/sharding levels for new indices, then Elasticsearch 'indices templates' have to be updated inside the appropriate cluster.
This is the same as during post-deployment configuration.
So it implies :
- importing new/updated 'template' files
- using the
punchplatform-push-es-templates.shtool to upload the new templates in the cluster
- (sometimes) use
curl -XDELETE myindex:9200/_template/sometemplateto purge an obsolete indice template from the cluster
- then wait for a new indice to be created, so that the templates apply to it (existing indices are not impacted by an update in the indices template)
- (sometimes) change and reload the configuration of some punchlines so that they write to an index with a new name so that
- some indexing configuration change is applied now.
Stopping/Re(starting Elasticsearch cluster)¶
A single Elasticsearch node can be stopped by
sudo systemctl stop elasticsearch command on the node.
This may cause a reduction of the data availability or cluster resilience. Elasticsearch may try to compensate by re-creating data replicas on an other node.
For this reason, before any node stop with a planned duration in excess of a few minutes, or in case of a planned full cluster stop, it is better to manually alter the runtime configuration of the cluster to prevent re-assignment of replicas.
This requires usage of specific Elasticsearch API calls, both before and after the maintainance phase (to reset the settings to normal operation mode after the cluster is up again.)
The overall procedure is well desribed in Elasticsearch official cluster restart procedure.
When (re)starting an Elasticsearch cluster, the cluster will be loaded for minutes before the data is available, and may need minutes or hours before the data is normally replicated again.
This can be highly sped up if the punchlines that normally write/index data into elasticsearch or applications that query elasticsearch (PML punchlines, correlation/alerting engine) are stopped before the cluster is stopped, and restarted only after the Elasticsearch cluster is at least in 'yellow' status again (i.e. data availability for reading/writing).
Archiving storage operation¶
Checking/Sizing/viewing some archiving content¶
When documents are archived through usage of File Node + Elasticsearch storage of archive indexing data (see Reference archiving punchline example), then the archive content can be sized/viewed in the following ways:
- From a statistics/metrics point of view, by using a Kibana archiving dashboard, that will provide information of the stored content based on the indexed Archiving Metadata.
Extracting an archive files selection into kafka¶
For extracting archive documents into filesystem output files or to (re)index them into Elasticsearch, the process is the same:
- Select and extract archive files into kafka
- Use some punchline to write the documents into filesystem output or into Elasticsearch
The Extraction process is described here. But because the content of the archive (format, backend, stored fields..) is platform-specific, no 'off the shelf' extraction punchline is delivered ; it has to be designed and validated by the integration when the archiving configuration is designed and validated, not later !
The extraction punchline is made of two parts
An Extraction input node, in charge of running a selection query of metadata from the archive metadata indexing Elasticsearch. This query can be designed by hand using kibana, to determine which amount of data is wanted (through applying filters on time and topic)
An archive reader node, in charge of opening each file identified by its metadata (read from Elasticsearch) and of emitting part or all of the read documents/fields.
Note that it is possible to filter the documents that are output by the archive reader, if
some pattern can be defined to select
interesting lines (such as an IP address or domain name pattern).
Configuring an automated purge of obsolete archived data¶
Same as for Elasticsearch housekeeping, a standard Punchplatform Archive housekeeping service can be configured in each individual tenant to ensure removal of archived data past a given retention duration.
protection against mass destruction
Do not forget to set
max_deletion_percentage to a reasonable value (say 2%) to allow
automatic (daily) purging of old data, without risking automatic mass erasure in case
of incident related to operating system time shifting to the future.
Zookeeper nodes can be individually stopped using
sudo systemctl stop zookeeper-<clusterId> command on the node.
If half the nodes are stopped, the cluster goes offline, which will stop Kafka and Storm (and therefore Shiva and most punchlines)
If the leader node is stopped, the cluster does a re-elect of an other leader. This may cause a few seconds to a few minutes of interruption to kafka and the depending punchlines (while all clients re-connect to the zookeeper cluster, and re-elect leader nodes.
A clean stop of the whole cluster only implies stopping all the zookeeper nodes.
They will restart automatically if the host VM/server (re)boots, or using
systemctl to start manually the stopped service.
Shiva management data is consistently persisted in Kafka. Shiva nodes can be stopped in any order by using
sudo systemctl stop shiva-runner.
When a shiva runner is stopped using
systemctl, all the children process (punchlines) running on the node
under supervision of this shiva node will be killed.
The cluster will take care of restarting the killed tasks elsewhere in the shiva cluster if allowed by the placement constraints (shiva 'tags' on the task).
Storm cluster persists its critical management data consistently in Zookeeper, except the punchline/topologies binaries.
This means that nimbus nodes should only be stopped when no channel/application start has been recently requested
channelctl (i.e. in the last minutes).
This constraint excluded, all nodes can be individually or collectively stopped using
sudo systemctl stop storm-<module> where module is one of
Stopping/(Re-)starting a whole Punchplatform¶
The stopping/starting order is better done in the dependency-related order, so that we reduce all risks of inconsistent critical data on the persistence filesystem.
The best order is therefore:
- Stop channels (if time allows)
- Stop metricbeat daemons (
sudo systemctl stop metricbeaton all nodes through Ansible usage from deployer node)
- Configure and stop Elasticsearch
- stop Shiva
- stop Storm
- stop Kafka
- stop Zookeeper
When restarting the platform, everything can go up at the same time, although, if time allows, the servers restart order should be:
- start Elasticsearch servers. Wait for at least 'yellow' status (green is better but longer).
- (if a clean stop was done) reconfigure Elasticsearch to allow moving/creating of replicas
- start Zookeeper and Kafka servers
- start Storm and Shiva servers
- (if they were stopped) start the channels