Skip to content

Plan-sparkline helm (WIP)

Abstract

Plan enables you to schedule your spark/pyspark periodically. The periodicity can be set with a cron expression.

Warning

This section describes the punch helm plan-sparkline usage only for debugging. His administration (start/status/stop) is wrapped in punch commands so you do not need to use it directly.

Deploy plan-sparkline

TODO

Configuration

TODO

Runtime behavior on plan submission

When a plan is submitted to a kubernetes cluster, a Job resource is created. With a given cron-expression, a punchline will be scheduled at regular interval. In our scenario, a spark/pyspark punchline is submitted and has it's status monitored by the plan application.

Checkpointing

It is possible to define a checkpoint backend for our plan application. Currently, only elasticsearch is supported.

This enable you to start from the last successful scheduled date in case of any sort of failure...

Configuration

Two files are needed to use our plan application:

  • plan.yaml
  • template.yaml

plan.yaml

In a plan configuration file, you can define a set of variables, for instance dates, which can be access as variables in the template file by using the notation {% raw -%}{{ myVar }}{% endraw -%}

---
tenant: mytenant
version: '6.0'
name: plansparkoperator
channel: plansparkoperator
model:
  dates:
    from:
      offset: "-PT1m"
      format: yyyy-MM-dd'T'HH:mmZ
    to:
      format: yyyy-MM-dd'T'HH:mmZ
settings:
  cron: "*/1 * * * *"

template.yaml

Schedule date can be retrieved within template file with the help of templating. The templating engine being used is the well known jinja2.

---
type: punchline
runtime: pyspark
version: '6.0'
dag:
- settings:
    input_data:
    - date: "{% raw -%}{{ from }}{% endraw -%}"
      name: from_date
    - date: "{% raw -%}{{ to }}{% endraw -%}"
      name: to_date
  component: input
  publish:
  - stream: data
  type: dataset_generator
- settings:
    truncate: false
  component: show
  subscribe:
  - component: input
    stream: data
  type: show
settings:
  spark.executor.instances: 1
  spark.kubernetes.executor.limit.cores: 0.5
  spark.kubernetes.executor.request.cores: 0.5

channel_structure.yaml

Warn

The runtime which your punchline will be executed in will be dictated by your template file runtime parameter. In our case: pyspark

Note

Path of template.yaml / plan.yaml is relative to channel_structure.yaml. I.e. $PUNCHPLATFORM_CONF_DIR/tenants/mytenant/mychannel

---
version: '6.0'
applications:
  - name: plansparkoperator
    runtime: kubernetes
    cluster: common
    kind: plan
    args: [
      "start",
      "--template", "template.yaml",
      "--plan", "plan.yaml"
    ]