Skip to content

Troubleshooting Shiva leader election and tasks assignments

=============================================================

Principles

Shiva is a cluster made of "runner" nodes, one of which will act as the "leader". Tasks are submitted (start/stop) through channelctl, and the leader will assign it to a runner. Therefore communication is needed between the submitter, and between the runner nodes, through an internal data storage backend.

This page applies to a production setup where the Shiva internal data storage is Kafka (i.e. in the deployment settings, the 'storage.type' for the shiva cluster has been set to "kafka":

[...]
  "shiva": {
    "clusters": {
      "front_shiva": {
        "zk_cluster": "zkf",
        "servers": {
          "pbrpfrshiv01": {
            "runner": true,
            "can_be_master": true,
            "tags": []
          },
          "pbrpfrshiv02": {
            "runner": true,
            "can_be_master": true,
            "tags": []
          }
        },
        "storage": {
          "type": "kafka",
          "kafka_cluster": "back"
        },
[...]

Shiva supports other storage types (e.g. a filesystem location, that can be used for single-node setups without kafka)

Shiva stores its internal data as key/values in a set of Kafka topics of the indicated kafka cluster (here 'back') which is described in the kafka.clusters deployment setting.

The Kafka topics are the following :

  • CONTROL TOPIC stores shiva runner heartbeat messages, and commands for start/stop of applications (written there by channelctl) standard name for CONTROL topic is : shiva--ctl

  • ASSIGNMENT TOPIC stores assignments of tasks to runner nodes, as decided by the shiva cluster leader, that writes periodically updates of this information standard name for ASSIGNMENT topic is : shiva--assignment

  • DATA TOPIC conveys details about the tasks standard name for ASSIGNMENT topic is : shiva--data

The actual cluster/topics can be found either in the shiva section of the deployment settings , or in the shiva runner deployed configuration file (usually in /data/opt/punch-shiva*/conf/shiva.conf).

Which runner nodes are alive ? (Viewing runner nodes heartbeat)

All live runner nodes are periodically publishing a heartbeat message to the CONTROL TOPIC.

You can see which runner are alive through this command (wait some seconds to see updates from live runners):

/data/opt/kafka_2.11-2.4.0/bin/kafka-console-consumer.sh --bootstrap-server pbrpmizkkaf01:9093 --topic shiva-processing_shiva-ctl  --property print.key=true --property print.timestamp=true --property key.separator="==> " | grep   worker

CreateTime:1598460879223==> hello_worker==> {"id":"pbrpmishiv01","tags":["pbrpmishiv01"]}
CreateTime:1598460813945==> hello_worker==> {"id":"pbrpmishiv02","tags":["pbrpmishiv02"]}

You can convert a timestamp easily using 'date' (remove the milliseconds of timestamp):

date --date=@1598460813

Wed Aug 26 18:53:33 CEST 2020

Which runner is the current leader of a cluster

The leader node is the one that is consuming the control partition. The leader node will use a kafka consumer group named '_shiva-leader'.

You can find the current leader by using the kafka-consumer-groups.sh tool '--describe' command. A shortcut is to use its punchplatform wrapper:

punchplatform-kafka-consumers.sh --kafkaCluster back --describe --group processing_shiva-leader

bootstrap servers : 'pbrpmizkkaf01:9093 pbrpmizkkaf02:9093 pbrpmizkkaf03:9093'

kafka consumers for kafka cluster 'back'...

GROUP                   TOPIC                      PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID                                                             HOST            CLIENT-ID
processing_shiva-leader shiva-processing_shiva-ctl 0          382             382             0               consumer-processing_shiva-leader-2-aae5ed84-c38a-4b33-958a-1c1f0f891a16 /20.20.16.71    consumer-processing_shiva-leader-2

What are the current assignments of tasks to runners

The leader node is updating periodically (a few seconds) the assignments of tasks to applicable live runners (taking into account the required 'tags' of each task) to choose runners having said required tags.

You can see current assignements through this command (wait some seconds to see updates from live runners):

''' /data/opt/kafka_2.11-2.4.0/bin/kafka-console-consumer.sh --bootstrap-server pbrpmizkkaf01:9093 --topic shiva-processing_shiva-assignement | head -n 1 | jq .

{ "cluster_name": "processing_shiva", "election_timestamp": "2020-08-26T16:31:57.062Z", "assignements": { "pbrpmishiv02": [ "tenants/platform/channels/platform_monitoring/platform_health" ], "pbrpmishiv01": [ "tenants/platform/channels/platform_monitoring/local_events_dispatcher" ] }, "leader_id": "pbrpmishiv01", "state": { "workers": { "pbrpmishiv02": { "id": "pbrpmishiv02", "tags": [ "pbrpmishiv02" ] }, "pbrpmishiv01": { "id": "pbrpmishiv01", "tags": [ "pbrpmishiv01" ] } }, "applications": { "tenants/platform/channels/platform_monitoring/platform_health": { "name": "tenants/platform/channels/platform_monitoring/platform_health", "tags": [] }, "tenants/platform/channels/platform_monitoring/local_events_dispatcher": { "name": "tenants/platform/channels/platform_monitoring/local_events_dispatcher", "tags": [] } } }, "version": "5.0", "unassigned_tasks": [], "applications": { "tenants/platform/channels/platform_monitoring/platform_health": { "args": [ "platform-monitoring", "platform_health.json" ], "cluster_name": "processing_shiva", "name": "tenants/platform/channels/platform_monitoring/platform_health", "execution_schedule": "", "tags": [] }, "tenants/platform/channels/platform_monitoring/local_events_dispatcher": { "args": [ "punchline", "--mode", "light", "--punchline", "local_events_dispatcher.hjson" ], "cluster_name": "processing_shiva", "name": "tenants/platform/channels/platform_monitoring/local_events_dispatcher", "execution_schedule": "", "tags": [] } } }

'''

You can have a more compact display through a jq filter :

 /data/opt/kafka_2.11-2.4.0/bin/kafka-console-consumer.sh --bootstrap-server pbrpmizkkaf01:9093 --topic shiva-processing_shiva-assignement | head -n 1  |jq -r '(.assignements | to_entries[] | .key as $HOST |  .value[] | ( . + " ==> " + $HOST) )'  

tenants/platform/channels/platform_monitoring/platform_health ==> pbrpmishiv02
tenants/platform/channels/platform_monitoring/local_events_dispatcher ==> pbrpmishiv01

Where are the shiva logs ?

Runner daemon logs location

The runner/leader daemon logs are usually in /var/log/punch*/shiva/shiva-runner-daemon.log.

Task logs

A Shiva task/application can logs events in two ways:

  1. Simply log message to stdout/stderr.
  2. Log to a specified file using its own logic, for example using its own log4j configuration file.

Case 1: Output to stdout/stderr

The shiva runner daemon captures such logs, and send them - locally on the runner daemon, into /var/log/punch*/shiva/shiva-runner-subprocess.log - optionnally into the configured 'reporter' associated to this shiva cluster. So this depends on your platform setup, but it is a good production practice to - send these events (through a kafka reporter and an events dispatcher punchline) into a central Elasticsearch indice (usually called platform-logs-).

Check out the [platform-log-*] index using your admin Kibana. You will see there your task logs.

Case 2: Writing to a file

You must analyze the task type and its specific arguments (technology? logger configuration?..), to know where its local log files are located on the runner node. Shiva is not in charge of those log files..

How to run manually a submitted task to reproduce errors ?

When a task is submitted, the shiva runner loads the task command and arguments to a local folder on the assigned runner node. That folder is defined as one of the shiva startup argument. By default is use the local operating system temporary folder. There you will find a tree folder structure obeying the tenant/channel hierarchy. For example assuming you submitted the hello world task as explained in the Shiva guide :

.
└── punchplatform
  └── mytenant
      ├── channels
          └── hello_world_shiva
              └── my_hello_world_task
                  ├── CONTENT-1
                  └── hello_world.sh

To debug your command script, you can call it from this directory to see what happens.

To do so:

  • Connect on the shiva node that is supposed to run your task, and log in as the same user that runs the shiva daemon (this is the 'platform.punchplatform_daemons_user' from deployment settings)
  • run a 'bash' shell
  • run source /data/opt/punch-shiva-<yourversion>/activate.sh in order to define punch environment variables (path, conf...)
  • change directory to the temporary folder of your task : /tmp/shiva/punchplatform//channels///
  • run your task command line (with same command and args as indicated in the channel_structure file for the channel)