ADM Training - Track2: daily operation/most frequent operator actions¶

Most frequent operator actions are related to:

deployment or update of the running channels/applications configuration
data lifecycle management as part of a production incident handling or a specific business request (replaying/closing/opening/deleting/extracting/reloading)
planned maintainance of the punch platform or an external subsystem that interacts with it (stopping channels/applications/services/servers, restarting things)

Deployment or update of channels/aplication configuration¶

This mostly implies:

importing or changing the channels configuration/resources in the operator command-line environment
using reload or stop + start commands of channelctl command

The reload command (available since 6.3 release) is a shortcut that can be configured at channel structure level, in order to differentiate applications that are not dangerous to stop/restart and those that should not usually be restarted without risks (the input punchlines on collectors, for example). So normally, reload command should always be used when updating some configuration. Needing to do 'stops' often imply some specific procedure to ensure best service availability.

Kafka Data/Topics Operation¶

Kafka data replay¶

One of the most frequent kafka-related operation is to obtain 'replay' of some topic data after changing some punchline configuration.

This is not actually a change of the kafka cluster, but an update of committed consumer offsets stored inside kafka.

Have a look at punchplatform-kafka-consumers.sh and especially '--reset-XXX' subcommands.

Quiz

Why is Kafka not truly a queuing system ?

Key Points

because kafka does not check if data has been read/processed by consumers before destroying it

Consumers are tracking to what point they have correctly read and processed each partition. This is stored in 'consumer groups' in __consumer_offsets topic.
because multiple readers can consume the same data without interfering.

Consumers interfere (in fact cooperate by distributing the responsibility of the partitions among themselves) if they use the same consumers group id.

Procedure steps for replaying some kafka data

check the topic name and consumer group name from the punchline configuration, and determine the current reading point/backlog by using the --describe subcommand of punchplatform-kafka-topics.sh.
stop the punchline that consumes the data, and that needs to replay data from the topic, using channelctl command.
user punchplatform-kafka-consumers.sh to change the committed offsets (resetting it to either the start of the topic with --reset-to-earliest or to some intermediate point by shifting backward the offset or by looking up a given date through the native kafka-consumers-group.sh command).
starting again the punchline that consumes the data

Kafka data purge or retention settings change¶

If for some reason you need to destroy some kafka queue or alter its retention/replication settings, you can use punchplatform-kafka-topics.sh or more advanced HOW TO.

Kafka consumers processing backlog checking¶

If you want to see easily if a punchline is lagging (i.e. has a big processing backlog of unprocessed messages), you can check this lag through the punchplatform-kafka-consumers.sh --describe subcommand.

Stopping/Restarting kafka¶

Kafka is mosthly handling recent data in memory. So if you suddenly stop multiple kafka broker nodes while some punchline is writing new messages in there, you will loose data.

A clean kafka stop implies either to

stop only one broker node in a healthy cluster (check its replication health through punchplatform-kafka-topics.sh --describe)

or

to stop writing to the cluster, by stopping the approriate punchlines, before stopping the whole kafka cluster.

Stopping nodes is done through sudo systemctl stop kafka-<clusterId> command on the broker node.

(Re-)starting stopped nodes is done through either (re)booting the containing VM/server, or through using sudo systemctl start kafka-<clusterId in any order.

Quiz

Check the lag of the dispatcher application of monitoring channel in the platform tenant of your training platform.

Elasticsearch Data/Indices operation¶

Manual data housekeeping¶

You can close/open/delete indices manually, using indices management REST API.

Remember the automatic elasticsearch housekeeping may interact with what you are doing...see details in the indices management procedure

Elasticsearch data lifecycle automation¶

For automating the purge of old data (or change in location/replication settings for such data), a specific standard punchplatform service has to be configured and run in each individual tenant.

Have a look at the available actions/settings you can configure in this Elasticsearch housekeeping service.

Key points

You can act on other things that just deletion/closure: you can move indices from fast to slow storage, or change replication in time.
A tenant 'Elasticsearch housekeeping' service can only act upon indices which names begin with <tenant name>-.

This allows different retention settings between tenants.

This prevents someone from mistakenly remove an other tenants indices by a copy-paste of housekeping settings.

Elasticsearch indexing (mappings) configuration change¶

In case new or updated channels and resources (e.g. log parsers) are deployed, or you want to change default replication/sharding levels for new indices, then Elasticsearch 'indices templates' have to be updated inside the appropriate cluster.

This is the same as during post-deployment configuration.

So it implies :

importing new/updated 'template' files
using the punchplatform-push-es-templates.sh tool to upload the new templates in the cluster
(sometimes) use curl -XDELETE myindex:9200/_template/sometemplate to purge an obsolete indice template from the cluster
then wait for a new indice to be created, so that the templates apply to it (existing indices are not impacted by an update in the indices template)
(sometimes) change and reload the configuration of some punchlines so that they write to an index with a new name so that
some indexing configuration change is applied now.

Stopping/Re(starting Elasticsearch cluster)¶

A single Elasticsearch node can be stopped by sudo systemctl stop elasticsearch command on the node.

This may cause a reduction of the data availability or cluster resilience. Elasticsearch may try to compensate by re-creating data replicas on an other node.

For this reason, before any node stop with a planned duration in excess of a few minutes, or in case of a planned full cluster stop, it is better to manually alter the runtime configuration of the cluster to prevent re-assignment of replicas.

This requires usage of specific Elasticsearch API calls, both before and after the maintainance phase (to reset the settings to normal operation mode after the cluster is up again.)

The overall procedure is well desribed in Elasticsearch official cluster restart procedure.

Important

When (re)starting an Elasticsearch cluster, the cluster will be loaded for minutes before the data is available, and may need minutes or hours before the data is normally replicated again.

This can be highly sped up if the punchlines that normally write/index data into elasticsearch or applications that query elasticsearch (PML punchlines, correlation/alerting engine) are stopped before the cluster is stopped, and restarted only after the Elasticsearch cluster is at least in 'yellow' status again (i.e. data availability for reading/writing).

Archiving storage operation¶

Checking/Sizing/viewing some archiving content¶

When documents are archived through usage of File Node + Elasticsearch storage of archive indexing data (see Reference archiving punchline example), then the archive content can be sized/viewed in the following ways:

From a statistics/metrics point of view, by using a Kibana archiving dashboard, that will provide information of the stored content based on the indexed Archiving Metadata.

Extracting an archive files selection into kafka¶

For extracting archive documents into filesystem output files or to (re)index them into Elasticsearch, the process is the same:

Select and extract archive files into kafka
Use some punchline to write the documents into filesystem output or into Elasticsearch

The Extraction process is described here. But because the content of the archive (format, backend, stored fields..) is platform-specific, no 'off the shelf' extraction punchline is delivered ; it has to be designed and validated by the integration when the archiving configuration is designed and validated, not later !

Key points

The extraction punchline is made of two parts

An Extraction input node, in charge of running a selection query of metadata from the archive metadata indexing Elasticsearch. This query can be designed by hand using kibana, to determine which amount of data is wanted (through applying filters on time and topic)
An archive reader node, in charge of opening each file identified by its metadata (read from Elasticsearch) and of emitting part or all of the read documents/fields.

Note that it is possible to filter the documents that are output by the archive reader, if some pattern can be defined to select interesting lines (such as an IP address or domain name pattern).

Configuring an automated purge of obsolete archived data¶

Same as for Elasticsearch housekeeping, a standard Punchplatform Archive housekeeping service can be configured in each individual tenant to ensure removal of archived data past a given retention duration.

protection against mass destruction

Do not forget to set max_deletion_percentage to a reasonable value (say 2%) to allow automatic (daily) purging of old data, without risking automatic mass erasure in case of incident related to operating system time shifting to the future.

Stopping/(Re)-starting Zookeeper¶

Zookeeper nodes can be individually stopped using sudo systemctl stop zookeeper-<clusterId> command on the node.

If half the nodes are stopped, the cluster goes offline, which will stop Kafka and Storm (and therefore Shiva and most punchlines)

If the leader node is stopped, the cluster does a re-elect of an other leader. This may cause a few seconds to a few minutes of interruption to kafka and the depending punchlines (while all clients re-connect to the zookeeper cluster, and re-elect leader nodes.

A clean stop of the whole cluster only implies stopping all the zookeeper nodes.

They will restart automatically if the host VM/server (re)boots, or using systemctl to start manually the stopped service.

Stopping/(Re-)starting Shiva¶

Shiva management data is consistently persisted in Kafka. Shiva nodes can be stopped in any order by using sudo systemctl stop shiva-runner.

Danger

When a shiva runner is stopped using systemctl, all the children process (punchlines) running on the node under supervision of this shiva node will be killed.

The cluster will take care of restarting the killed tasks elsewhere in the shiva cluster if allowed by the placement constraints (shiva 'tags' on the task).

Stopping/(Re-)starting Storm¶

Storm cluster persists its critical management data consistently in Zookeeper, except the punchline/topologies binaries.

This means that nimbus nodes should only be stopped when no channel/application start has been recently requested through channelctl (i.e. in the last minutes).

This constraint excluded, all nodes can be individually or collectively stopped using sudo systemctl stop storm-<module> where module is one of nimbus, ui or supervisor.

Stopping/(Re-)starting a whole Punchplatform¶

The stopping/starting order is better done in the dependency-related order, so that we reduce all risks of inconsistent critical data on the persistence filesystem.

The best order is therefore:

Stop channels (if time allows)
Stop metricbeat daemons (sudo systemctl stop metricbeat on all nodes through Ansible usage from deployer node)
Configure and stop Elasticsearch
stop Shiva
stop Storm
stop Kafka
stop Zookeeper

When restarting the platform, everything can go up at the same time, although, if time allows, the servers restart order should be:

start Elasticsearch servers. Wait for at least 'yellow' status (green is better but longer).
(if a clean stop was done) reconfigure Elasticsearch to allow moving/creating of replicas
start Zookeeper and Kafka servers
start Storm and Shiva servers
(if they were stopped) start the channels