ADM Training - Track2: daily operation/most frequent operator actions¶
Most frequent operator actions are related to:
- deployment or update of the running channels/applications configuration
- data lifecycle management as part of a production incident handling or a specific business request (replaying/closing/opening/deleting/extracting/reloading)
- planned maintainance of the punch platform or an external subsystem that interacts with it (stopping channels/applications/services/servers, restarting things)
Deployment or update of channels/aplication configuration¶
This mostly implies:
- importing or changing the channels configuration/resources in the operator command-line environment
- using
reload
orstop
+start
commands of channelctl command
The reload
command (available since 6.3 release) is a shortcut that can be configured at channel structure level,
in order to differentiate applications that are not dangerous to stop/restart and those that should not usually be restarted without risks (the input punchlines on collectors, for example).
So normally, reload
command should always be used when updating some configuration. Needing to do 'stops' often imply some specific procedure to ensure best service availability.
Kafka Data/Topics Operation¶
Kafka data replay¶
One of the most frequent kafka-related operation is to obtain 'replay' of some topic data after changing some punchline configuration.
This is not actually a change of the kafka cluster, but an update of committed consumer offsets stored inside kafka.
Have a look at punchplatform-kafka-consumers.sh and especially '--reset-XXX' subcommands.
Quiz
Why is Kafka not truly a queuing system ?
Key Points
-
because kafka does not check if data has been read/processed by consumers before destroying it
Consumers are tracking to what point they have correctly read and processed each partition. This is stored in 'consumer groups' in
__consumer_offsets
topic. -
because multiple readers can consume the same data without interfering.
Consumers interfere (in fact cooperate by distributing the responsibility of the partitions among themselves) if they use the same
consumers group id
.
Procedure steps for replaying some kafka data
-
check the
topic name
andconsumer group name
from the punchline configuration, and determine the current reading point/backlog by using the--describe
subcommand ofpunchplatform-kafka-topics.sh
. -
stop the punchline that consumes the data, and that needs to replay data from the topic, using
channelctl
command. -
user
punchplatform-kafka-consumers.sh
to change the committed offsets (resetting it to either the start of the topic with --reset-to-earliest
or to some intermediate point by shifting backward the offset or by looking up a given date through the native kafka-consumers-group.sh command). -
starting again the punchline that consumes the data
Kafka data purge or retention settings change¶
If for some reason you need to destroy some kafka queue or alter its retention/replication settings, you can use punchplatform-kafka-topics.sh or more advanced HOW TO.
Kafka consumers processing backlog checking¶
If you want to see easily if a punchline is lagging
(i.e. has a big processing backlog
of unprocessed messages),
you can check this lag through the
punchplatform-kafka-consumers.sh --describe
subcommand.
Stopping/Restarting kafka¶
Kafka is mosthly handling recent data in memory. So if you suddenly stop multiple kafka broker nodes while some punchline is writing new messages in there, you will loose data.
A clean kafka stop implies either to
- stop only one broker node in a healthy cluster (check its replication health through
punchplatform-kafka-topics.sh --describe
)
or
- to stop writing to the cluster, by stopping the approriate punchlines, before stopping the whole kafka cluster.
Stopping nodes is done through sudo systemctl stop kafka-<clusterId>
command on the broker node.
(Re-)starting stopped nodes is done through either (re)booting the containing VM/server, or through
using sudo systemctl start kafka-<clusterId
in any order.
Quiz
Check the lag of the dispatcher
application of monitoring
channel in the platform
tenant of your training platform.
Elasticsearch Data/Indices operation¶
Manual data housekeeping¶
You can close/open/delete indices manually, using indices management REST API.
Remember the automatic
elasticsearch housekeeping may interact with what you are doing...see details in the indices management procedure
Elasticsearch data lifecycle automation¶
For automating the purge of old data (or change in location/replication settings for such data), a specific standard punchplatform service has to be configured and run in each individual tenant.
Have a look at the available actions/settings you can configure in this Elasticsearch housekeeping service.
Key points
-
You can act on other things that just deletion/closure: you can move indices from fast to slow storage, or change replication in time.
-
A tenant 'Elasticsearch housekeeping' service can only act upon indices which names begin with
<tenant name>-
.This allows different retention settings between tenants.
This prevents someone from mistakenly remove an other tenants indices by a copy-paste of housekeping settings.
Elasticsearch indexing (mappings) configuration change¶
In case new or updated channels and resources (e.g. log parsers) are deployed, or you want to change default replication/sharding levels for new indices, then Elasticsearch 'indices templates' have to be updated inside the appropriate cluster.
This is the same as during post-deployment configuration.
So it implies :
- importing new/updated 'template' files
- using the
punchplatform-push-es-templates.sh
tool to upload the new templates in the cluster - (sometimes) use
curl -XDELETE myindex:9200/_template/sometemplate
to purge an obsolete indice template from the cluster - then wait for a new indice to be created, so that the templates apply to it (existing indices are not impacted by an update in the indices template)
- (sometimes) change and reload the configuration of some punchlines so that they write to an index with a new name so that
- some indexing configuration change is applied now.
Stopping/Re(starting Elasticsearch cluster)¶
A single Elasticsearch node can be stopped by sudo systemctl stop elasticsearch
command on the node.
This may cause a reduction of the data availability or cluster resilience. Elasticsearch may try to compensate by re-creating data replicas on an other node.
For this reason, before any node stop with a planned duration in excess of a few minutes, or in case of a planned full cluster stop, it is better to manually alter the runtime configuration of the cluster to prevent re-assignment of replicas.
This requires usage of specific Elasticsearch API calls, both before and after the maintainance phase (to reset the settings to normal operation mode after the cluster is up again.)
The overall procedure is well desribed in Elasticsearch official cluster restart procedure.
Important
When (re)starting an Elasticsearch cluster, the cluster will be loaded for minutes before the data is available, and may need minutes or hours before the data is normally replicated again.
This can be highly sped up if the punchlines that normally write/index data into elasticsearch or applications that query elasticsearch (PML punchlines, correlation/alerting engine) are stopped before the cluster is stopped, and restarted only after the Elasticsearch cluster is at least in 'yellow' status again (i.e. data availability for reading/writing).
Archiving storage operation¶
Checking/Sizing/viewing some archiving content¶
When documents are archived through usage of File Node + Elasticsearch storage of archive indexing data (see Reference archiving punchline example), then the archive content can be sized/viewed in the following ways:
- From a statistics/metrics point of view, by using a Kibana archiving dashboard, that will provide information of the stored content based on the indexed Archiving Metadata.
Extracting an archive files selection into kafka¶
For extracting archive documents into filesystem output files or to (re)index them into Elasticsearch, the process is the same:
- Select and extract archive files into kafka
- Use some punchline to write the documents into filesystem output or into Elasticsearch
The Extraction process is described here. But because the content of the archive (format, backend, stored fields..) is platform-specific, no 'off the shelf' extraction punchline is delivered ; it has to be designed and validated by the integration when the archiving configuration is designed and validated, not later !
Key points
The extraction punchline is made of two parts
-
An Extraction input node, in charge of running a selection query of metadata from the archive metadata indexing Elasticsearch. This query can be designed by hand using kibana, to determine which amount of data is wanted (through applying filters on time and topic)
-
An archive reader node, in charge of opening each file identified by its metadata (read from Elasticsearch) and of emitting part or all of the read documents/fields.
Note that it is possible to filter the documents that are output by the archive reader, if
some pattern can be defined to select interesting
lines (such as an IP address or domain name pattern).
Configuring an automated purge of obsolete archived data¶
Same as for Elasticsearch housekeeping, a standard Punchplatform Archive housekeeping service can be configured in each individual tenant to ensure removal of archived data past a given retention duration.
protection against mass destruction
Do not forget to set max_deletion_percentage
to a reasonable value (say 2%) to allow
automatic (daily) purging of old data, without risking automatic mass erasure in case
of incident related to operating system time shifting to the future.
Stopping/(Re)-starting Zookeeper¶
Zookeeper nodes can be individually stopped using sudo systemctl stop zookeeper-<clusterId>
command on the node.
If half the nodes are stopped, the cluster goes offline, which will stop Kafka and Storm (and therefore Shiva and most punchlines)
If the leader node is stopped, the cluster does a re-elect of an other leader. This may cause a few seconds to a few minutes of interruption to kafka and the depending punchlines (while all clients re-connect to the zookeeper cluster, and re-elect leader nodes.
A clean stop of the whole cluster only implies stopping all the zookeeper nodes.
They will restart automatically if the host VM/server (re)boots, or using systemctl
to start manually the stopped service.
Stopping/(Re-)starting Shiva¶
Shiva management data is consistently persisted in Kafka. Shiva nodes can be stopped in any order by using sudo systemctl stop shiva-runner
.
Danger
When a shiva runner is stopped using systemctl
, all the children process (punchlines) running on the node
under supervision of this shiva node will be killed.
The cluster will take care of restarting the killed tasks elsewhere in the shiva cluster if allowed by the placement constraints (shiva 'tags' on the task).
Stopping/(Re-)starting Storm¶
Storm cluster persists its critical management data consistently in Zookeeper, except the punchline/topologies binaries.
This means that nimbus nodes should only be stopped when no channel/application start has been recently requested
through channelctl
(i.e. in the last minutes).
This constraint excluded, all nodes can be individually or collectively stopped using sudo systemctl stop storm-<module>
where module is one of nimbus
, ui
or supervisor
.
Stopping/(Re-)starting a whole Punchplatform¶
The stopping/starting order is better done in the dependency-related order, so that we reduce all risks of inconsistent critical data on the persistence filesystem.
The best order is therefore:
- Stop channels (if time allows)
- Stop metricbeat daemons (
sudo systemctl stop metricbeat
on all nodes through Ansible usage from deployer node) - Configure and stop Elasticsearch
- stop Shiva
- stop Storm
- stop Kafka
- stop Zookeeper
When restarting the platform, everything can go up at the same time, although, if time allows, the servers restart order should be:
- start Elasticsearch servers. Wait for at least 'yellow' status (green is better but longer).
- (if a clean stop was done) reconfigure Elasticsearch to allow moving/creating of replicas
- start Zookeeper and Kafka servers
- start Storm and Shiva servers
- (if they were stopped) start the channels