Skip to content

Metrics/logs in Apache Spark

Description

The process of executing a spark application (punchline) is divided into events of different types:

  • spark.application.start
  • spark.application.end
  • spark.application.failed
  • spark.job.start
  • spark.job.end
  • spark.stage.start
  • spark.stage.end
  • spark.task.start
  • spark.task.end

Those events are published at the level of a spark/pyspark punchline. Hence, to activate the submission of those informations to a database, you will have to define one of our reporters.

Events are published by default as JSON and should be sent to a kafka topic. In the advent you want to output a different format, for instance CSV, You can use a punchline reading from the kafka topic and transform your event in your desirered format.

Finally, to make a long story short, we decided to publish those events using the ECS 1.5 format. Published events are designed for taking making user-friendly dashboard and with multiple granularity of data filtering while ensuring that no visualization will be broken while users will be investigating their data. Therefore, it was unavoidable of having redundant fields event-wise.

Events definition

Note 1: a spark application is divided into mutiple jobs

Note 2: a spark job is divided into multiple stages

Note 3: a spark stage is divided into multiple tasks

Note 4: data locality for a given stage, hence for a set of tasks should be on the same server

Note 5: mutliple jobs are not executed in parallel, but stages of a given job are !

Event Definition
spark.application.start This event always happen and is published when a punchline using a spark or pyspark runtime is executed.
spark.application.end This event always happen and is published when a punchline using a spark or pyspark runtime is going to exit (0 or 1).
spark.application.failed This event is published when an exception occurs at application level for a punchline using spark or pyspark runtime.
spark.job.start This event is published when a job is going to start for a given punchline using spark or pyspark runtime.
spark.job.end This event is published when a job is going to end for a given punchline using spark or pyspark runtime.
spark.stage.start This event is published when a stage of a given job is going to start for a given punchline using spark or pyspark runtime.
spark.stage.end This event is published when a stage of a given job is going to end for a given punchline using spark or pyspark runtime.
spark.task.start This event is published when a task of a given stage is going to start for a given punchline using spark or pyspark runtime.
spark.task.end This event is published when a task of a given stage is going to end for a given punchline using spark or pyspark runtime.

Event examples

Note 1: Due to the nature on how spark works internally, labels.application.progression is calculated per job. An application may be composed of severals jobs.

Application End

{
    "labels": {
      "level": "info",
      "runtime": "pyspark",
      "duration": {
        "application_ms": 5000
      },
      "application": {
        "deploy_mode": "client",
        "driver_host": "localhost",
        "progression": 100
      },
      "id": {
        "application": "local-1599119321655"
      },
      "event": {
        "timestamp": "2020-09-03T07:48:47.827Z",
        "type": "application",
        "id": "local-1599119321655",
        "failure_reason": "None"
      }
    },
    "platform": {
      "tenant": "mytenant",
      "id": "standalone",
      "channel": "default",
      "version": "6.0",
      "application": {
          "name": "punchline.template",
          "id": "mytenant_default_punchline.template"
      }
    },
    "@timestamp": "2020-09-03T07:48:47.827Z",
    "message": "spark.application.end",
    "tags": [],
    "host": {
      "hostname": "PUNCH",
      "os": "Linux",
      "domain": "PUNCH",
      "ip": "127.0.1.1",
      "name": "PUNCH",
      "user": "jonathan",
      "architecture": "amd64"
    },
    "ecs": {
      "version": "1.5.0"
    }
}

Job End

{
    "labels": {
      "level": "info",
      "runtime": "pyspark",
      "duration": {
        "application_ms": 2000
      },
      "application": {
        "deploy_mode": "client",
        "driver_host": "localhost",
        "progression": 100
      },
      "id": {
        "job": "local-1599119321655_3",
        "application": "local-1599119321655"
      },
      "event": {
        "timestamp": "2020-09-03T07:48:47.827Z",
        "type": "job",
        "parent": "local-1599119321655",
        "id": "local-1599119321655_3",
        "status": "JobSucceeded",
        "failure_reason": "None"
      }
    },
    "platform": {
      "tenant": "mytenant",
      "id": "standalone",
      "channel": "default",
      "version": "6.0",
      "application": {
          "name": "punchline.template",
          "id": "mytenant_default_punchline.template"
      }
    },
    "@timestamp": "2020-09-03T07:48:47.827Z",
    "message": "spark.job.end",
    "tags": [],
    "host": {
      "hostname": "PUNCH",
      "os": "Linux",
      "domain": "PUNCH",
      "ip": "127.0.1.1",
      "name": "PUNCH",
      "user": "jonathan",
      "architecture": "amd64"
    },
    "ecs": {
      "version": "1.5.0"
    }
}

Stage End

{
    "labels": {
      "level": "info",
      "runtime": "pyspark",
      "duration": {
        "application_ms": 5000
      },
      "application": {
        "deploy_mode": "client",
        "driver_host": "localhost",
        "progression": 50
      },
      "id": {
        "stage": "local-1599119321655_3_3",
        "job": "local-1599119321655_3",
        "application": "local-1599119321655"
      },
      "event": {
        "timestamp": "2020-09-03T07:48:47.827Z",
        "type": "stage",
        "parent": "local-1599119321655_3",
        "id": "local-1599119321655_3_3",
        "status": "succeeded",
        "num_tasks": 11,
        "failure_reason": "None"
      }
    },
    "platform": {
      "tenant": "mytenant",
      "id": "standalone",
      "channel": "default",
      "version": "6.0",
      "application": {
          "name": "punchline.template",
          "id": "mytenant_default_punchline.template"
      }
    },
    "@timestamp": "2020-09-03T07:48:47.827Z",
    "message": "spark.stage.end",
    "tags": [],
    "host": {
      "hostname": "PUNCH",
      "os": "Linux",
      "domain": "PUNCH",
      "ip": "127.0.1.1",
      "name": "PUNCH",
      "user": "jonathan",
      "architecture": "amd64"
    },
    "ecs": {
      "version": "1.5.0"
    }
}

Task End

{
    "labels": {
      "level": "info",
      "runtime": "pyspark",
      "duration": {
        "application_ms": 5000
      },
      "application": {
        "deploy_mode": "client",
        "driver_host": "localhost",
        "progression": 50
      },
      "id": {
        "task": "local-1599119321655_3_3_29",
        "stage": "local-1599119321655_3_3",
        "job": "local-1599119321655_3",
        "application": "local-1599119321655"
      },
      "event": {
        "timestamp": "2020-09-03T07:48:47.827Z",
        "type": "task",
        "parent": "local-1599119321655_3_3",
        "id": "local-1599119321655_3_3_29",
        "status": "SUCCESS",
        "failure_reason": "None",
        "duration": 60,
        "cpu_time": 10000,
        "disk_spilled_time": 0,
        "attempt": 0,
        "gc_time": 0,
        "host": "localhost",
        "index": 8,
        "mem_spilled_time": 0,
        "num": 8,
        "reason": "Success",
        "size": 1723,
        "action": "ResultTask"
      }
    },
    "platform": {
      "tenant": "mytenant",
      "id": "standalone",
      "channel": "default",
      "version": "6.0",
      "application": {
          "name": "punchline.template",
          "id": "mytenant_default_punchline.template"
      }
    },
    "@timestamp": "2020-09-03T07:48:47.827Z",
    "message": "spark.task.end",
    "tags": [],
    "host": {
      "hostname": "PUNCH",
      "os": "Linux",
      "domain": "PUNCH",
      "ip": "127.0.1.1",
      "name": "PUNCH",
      "user": "jonathan",
      "architecture": "amd64"
    },
    "ecs": {
      "version": "1.5.0"
    }
}

Task Start

{
    "labels": {
      "level": "info",
      "runtime": "pyspark",
      "duration": {
        "application_ms": 5000
      },
      "application": {
        "deploy_mode": "client",
        "driver_host": "localhost",
        "progression": 50
      },
      "id": {
        "task": "local-1599119321655_3_3_29",
        "stage": "local-1599119321655_3_3",
        "job": "local-1599119321655_3",
        "application": "local-1599119321655"
      },
      "event": {
        "timestamp": "2020-09-03T07:48:47.827Z",
        "type": "task",
        "parent": "local-1599119321655_3_3",
        "id": "local-1599119321655_3_3_29",
        "status": "SUCCESS",
        "failure_reason": "None",
        "attempt": 0,
        "executor_id": "driver",
        "host": "localhost",
        "index": 8,
        "mem_spilled_time": 0,
        "num": 8,
        "size": 1723
      }
    },
    "platform": {
      "tenant": "mytenant",
      "id": "standalone",
      "channel": "default",
      "version": "6.0",
      "application": {
          "name": "punchline.template",
          "id": "mytenant_default_punchline.template"
      }
    },
    "@timestamp": "2020-09-03T07:48:47.827Z",
    "message": "spark.task.end",
    "tags": [],
    "host": {
      "hostname": "PUNCH",
      "os": "Linux",
      "domain": "PUNCH",
      "ip": "127.0.1.1",
      "name": "PUNCH",
      "user": "jonathan",
      "architecture": "amd64"
    },
    "ecs": {
      "version": "1.5.0"
    }
}

Stage Start

{
    "labels": {
      "level": "info",
      "runtime": "pyspark",
      "duration": {
        "application_ms": 5000
      },
      "application": {
        "deploy_mode": "client",
        "driver_host": "localhost",
      },
      "id": {
        "stage": "local-1599119321655_3_3",
        "job": "local-1599119321655_3",
        "application": "local-1599119321655"
      },
      "event": {
        "timestamp": "2020-09-03T07:48:47.827Z",
        "type": "stage",
        "parent": "local-1599119321655_3",
        "id": "local-1599119321655_3_3",
        "status": "SUCCESS",
        "failure_reason": "None"
      }
    },
    "platform": {
      "tenant": "mytenant",
      "id": "standalone",
      "channel": "default",
      "version": "6.0",
      "application": {
          "name": "punchline.template",
          "id": "mytenant_default_punchline.template"
      }
    },
    "@timestamp": "2020-09-03T07:48:47.827Z",
    "message": "spark.task.submitted",
    "tags": [],
    "host": {
      "hostname": "PUNCH",
      "os": "Linux",
      "domain": "PUNCH",
      "ip": "127.0.1.1",
      "name": "PUNCH",
      "user": "jonathan",
      "architecture": "amd64"
    },
    "ecs": {
      "version": "1.5.0"
    }
}

Application Start

{
    "labels": {
      "level": "info",
      "runtime": "pyspark",
      "duration": {
        "application_ms": 5000
      },
      "application": {
        "deploy_mode": "client",
        "driver_host": "localhost",
      },
      "id": {
        "application": "local-1599119321655"
      },
      "event": {
        "timestamp": "2020-09-03T07:48:47.827Z",
        "type": "application",
        "id": "local-1599119321655",
        "failure_reason": "None"
      }
    },
    "platform": {
      "tenant": "mytenant",
      "id": "standalone",
      "channel": "default",
      "version": "6.0",
      "application": {
          "name": "punchline.template",
          "id": "mytenant_default_punchline.template"
      }
    },
    "@timestamp": "2020-09-03T07:48:47.827Z",
    "message": "spark.task.submitted",
    "tags": [],
    "host": {
      "hostname": "PUNCH",
      "os": "Linux",
      "domain": "PUNCH",
      "ip": "127.0.1.1",
      "name": "PUNCH",
      "user": "jonathan",
      "architecture": "amd64"
    },
    "ecs": {
      "version": "1.5.0"
    }
}