Skip to content

Sparkline

Sparkline CRD instances are managed by the Sparkline Operator

Sparkline instances are designed to answer batching pipelines use-cases.

Note

Although it is possible to use spark structured streaming within a punchline, we do not consider such use-case to be production ready.

Sparkline Operator lifecycle

Note

Only the core loop is described below and this is not the complete lifecycle of a sparkline instance

A sparkline instance can go through five different phases, similarly to Pod Phases: Pending, Running, Succeeded, Failed and Unknown.

  • When an instance of a Sparkline CRD is submitted to kubernetes ApiServer
  • sparkline instance status should be empty
  • Sparkline Operator will catch the submitted event and update the sparkline instance status to Pending
  • During Pending phase, needed kubernetes sub-resources (pods and configmaps) will be created by the Sparkline Operator
  • OwnerReferences are also set to all created sub-resources to the sparkline instance
  • Finalizers are also set to the sparkline instance, for the garbage of the dynamically created driver pod by spark-submit pod
  • When all sub-resources are created successfully, the sparkline instance status will be updated to Running
  • While in Running phase, the Sparkline Operator will aggregate all sub-resources owned by sparkline instance and update the sparkline instance status based on the aggregated result

Mutating/Validating webhooks

Using webhooks with Sparkline instances

Any Sparkline CRD instances defining the .metadata.annotations.platform.gitlab.thalesdigital.io/platform: <PLATFORM_CRD_INSTANCE_NAME> will have its fields updated and validated based on the <PLATFORM_CRD_INSTANCE_NAME> resource.

apiVersion: punchline.gitlab.thalesdigital.io/v1
kind: Sparkline
metadata:
  name: sparkline-sample
  annotations:
    platform.gitlab.thalesdigital.io/platform: "<PLATFORM_CRD_INSTANCE_NAME>"
...

The usage of this field is mandatory in the scenario that sensitive objects must be used. I.E. secrets.

Configuration

Native kubernetes fields

Fields such as:

  • .apiVersion
  • .kind
  • .metadata

...are common fields, part of kubernetes terminology.

apiVersion: punchline.gitlab.thalesdigital.io/v1
kind: Sparkline
metadata:
  name: sparkline-java-sample
...

.metadata field is propagated to all the sparkline instance sub-resources.

Customizing an instance based on .spec field

spec:
  # can be java or python
  # java: spark runtime
  # python: pyspark runtime
  implementation: java
  # define an image name
  # should be one of our published sparkline image tag
  # see: ghcr.io/punchplatform
  image: sparkline:7.0.1
  # In general, this field is taken care by our webhooks
  # Define a SA in case additional rbac or imagePullSecrets are needed during runtime
  serviceAccount: admin-user
  # In general, this field is taken care by our webhooks
  # can be any initcontainer image as long as it follows our operator defined interface
  # we do provide one in our private repository
  # ghcr.io/punchplatform
  initContainerImage: resourcectl:7.0.1
  imagePullPolicy: IfNotPresent
  # setting this to true will result in the submitted instance to be garbage upon Succeeded Phase
  # to be used only when oneshot: true
  garbageCollect: false
  # In general, this field is taken care by our webhooks
  # This field enables you to mount secret resources belonging in the same namespace as the sparkline instance
  # so as your program can consume them for various purpose: e.g. fetching data from an elasticsearch cluster.
  secretRefs:
    - name: "resourcectl-tls"
      MountPath: "/var/run/kubernetes/platform/secrets/resourcectl/resourcectl-tls"
  # define a list of dependencies this sparkline depends on
  dependencies:
    - punch-parsers:org.thales.punch:punch-websense-parsers:1.0.0
    - punch-parsers:org.thales.punch:common-punchlets:4.0.2
    - file:org.thales.punch:geoip-resources:1.0.1
  # dag definition of a sparkline punchline
  punchline:
    dag:
      - settings:
          input_data:
            - date: "{{ from }}"
              name: from_date
            - date: "{{ to }}"
              name: to_date
        component: input
        publish:
          - stream: data
        type: dataset_generator
      - settings:
          truncate: false
        component: show
        subscribe:
          - component: input
            stream: data
        type: show
  # key (string) - value (string) 
  # of known sparkline settings
  settings:
    spark.kubernetes.container.image.pullPolicy: IfNotPresent
  # define additional files you want to be mounted on the container filesystem during runtime
  # key: file_name
  # value: file_content
  configs:
    # this will create a file 'myCustomConfMountedOnPod'
    # with content: <value of the key>
    myCustomConfMountedOnPod: |
      # this content will be mounted on
      # the pod container local filesystem at
      # /data/myCustomConfMountedOnPod
      test: hello world

Example(s)

Java

Standard / Dataset Generator to Stdout

---
apiVersion: punchline.gitlab.thalesdigital.io/v1
kind: Sparkline
metadata:
  name: java-sample
spec:
  image: ghcr.io/punchplatform/sparkline:7.0.1-SNAPSHOT
  imagePullPolicy: IfNotPresent
  serviceAccount: admin-user
  garbageCollect: false
  implementation: java
  settings:
    spark.executor.instances: "1"
    spark.kubernetes.authenticate.driver.serviceAccountName: admin-user
  punchline:
    dag:
    - settings:
        input_data:
        - date: "{{ from }}"
          name: from_date
        - date: "{{ to }}"
          name: to_date
      component: input
      publish:
      - stream: data
      type: dataset_generator
    - settings:
        truncate: false
      component: show
      subscribe:
      - component: input
        stream: data
      type: show

Python

Standard / Dataset Generator to Stdout

---
apiVersion: punchline.gitlab.thalesdigital.io/v1
kind: Sparkline
metadata:
  name: python-sample
spec:
  image: ghcr.io/punchplatform/sparkline:7.0.1-SNAPSHOT
  imagePullPolicy: IfNotPresent
  serviceAccount: admin-user
  garbageCollect: false
  implementation: python
  settings:
    spark.executor.instances: "1"
    spark.kubernetes.authenticate.driver.serviceAccountName: admin-user
  punchline:
    dag:
    - settings:
        input_data:
        - date: "{{ from }}"
          name: from_date
        - date: "{{ to }}"
          name: to_date
      component: input
      publish:
      - stream: data
      type: dataset_generator
    - settings:
        truncate: false
      component: show
      subscribe:
      - component: input
        stream: data
      type: show